ETL with Talend and Greenplum - Part one: connections

September 16, 2011
hen working with databases, one of the most common task is to load data from one or more CSV files.
Several tools are available to achieve this task. Some are executed via command line, like COPY (using psql), some are more complex, like ETL systems.
We will start today with Talend but, in the next weeks, we will proceed with Kettle (Pentaho Data Integration).

In this tutorial we will learn how to import data on a Greenplum database from two CSV files and how to execute a lookup operation, using Talend. For the sake of simplicity, we assume that:

  • you have a very simple test database in Greenplum with two tables and a logical 1:N relationship between them: we chose states and users (a user can live in just one state)
  • you have two CSV files with the same data structure, one for the states and one for the users

The objective of this article is to demonstrate the concepts using a simple example, so that you can then apply the same techniques to more complex scenarios.

The first step is to download Talend Open Studio ( you can find it on the Talend Site ) and to install it.

Then we need to create a connection to Greenplum. From the Metadata section of Talend’s interface, add a Greenplum connection and insert the credentials. When done, a new object of type DbConnection will be created.

Right-click on it then select “Retrieve Schema“, so that Talend can gather all the needed information about the structure of the users and states tables.

Now it’s time to add the files that contain the data. Immediately below the “Db Connections” icon, you can find the group of delimited files. Click on “Create file delimited“.

In the first screenshot of the wizard, you will be asked to insert a name and a description for the resource being created. In step two, you need to locate the file and set the encoding.
In the third step, depicted in the picture below, you need to define row parsing methods.
You can select the encoding (1), field and row separators (2) and, if necessary, header rows or rows to be skipped (3) – either at the beginning or the end of the file.
It is also possible to ask Talend to auto-detect the structure of the file, by using the first row as header (4).

Click on “Next” then check that everything is fine and finish the process. You will have your file correctly associated to the project. Perform the above operation for both the users and states files. Once you have done, you can proceed with the creation of a job. Drag both files on the job, then select tFileInputDelimited as file type.

From the previously created Greenplum, drag the two tables (in our case states and users) in the editor, then select tGreenplumOutput as type.

That’s it for now. We have been able to create a connection, add two files and two tables in our project. We will continue with the loading phase in the second part of this article.
Share this

More Blogs