ETL with Talend and Greenplum - Part two: data import

September 19, 2011

In the first part of this tutorial, we have set up all the connections required for creating the job, now we can proceed with data import.

Let’s drag and drop inside the visual editor an object named tMap. You can find it on the left, in the instruments palette, inside the “elaboration” folder.

Now, we need to connect the “states” CSV object with the tMap element (right-Click on the CSV element -> rows -> main ) then connect the tMap element with the destination table ( right-click rows -> new output ). Once the three elements are connected we need to access the tMap object in order to edit the fields associations.
By dragging the fields from the left table to one on the right, it is possible to associate every field of the CSV with the destination column on the target database table. Clicking on “Ok” the field association will be saved, and the data from the CSV file are ready to be imported.
Now it is time to add to the job the import for the users table.
It’s important to remember that the users table have a one to many relationship with the states (one state can have many users, one user can have only one state) , so during the import we need to perform a lookup on the states table to be sure to maintain the relationship between the two tables. With “lookup” we refer to searching a value inside a dictionary, using a key. The goal is to retrieve an ID (usually the primary key of an object in the database) to be stored in the “many” relationship and to maintain the referential integrity.
Add a tMap object between the CSV file containing the users and the destination table. Also, from the list of the tables in the database, drag a “states” table object inside the visual editor, using the tGreenplumInput type.
Connect the three elements to the tMap element, as before.
Now it’s time to map the elements. This time the idState field from the CSV will be mapped to the idState column from the “states” table (called row3 in the example image) and this field will be mapped to the “users” table. Doing so for every row of the CSV file, will trigger a  check on the states table. The correct ID will then be assigned to the destination row on the table “users”.
As a final step, we need to add a conditional link between the two subjobs (the stats import, and the users import, lookup included). This is because the second job can be successful only if the first one completely imports the data in the states table.
To obtain this, right click on the output table “states”, select the “Trigger” option, and then the  “onComponentOk”. Link the “states” component to the users input CSV file. You will get something like this:
The job is ready to be executed. Simply search the “Run” tab on the lower part of the screen and click on the run button. The import operation will be executed and data will be imported.
Even though the operations showed in this two-part article are quite simple, you can repeat them for all the tables (dimensions and facts) of your data warehouse.
For more information, do not hesitate to contact us. In the next weeks you will see more articles about Greenplum and ETL tools in our blog.
Share this

More Blogs