ETL with Kettle & Greenplum: Importing CSV Data

Giulio Calacoci

October 24, 2011

In the first part of this article we have created a job, a database connection and defined the flow in Kettle. In the second part we’ll see how Kettle manages the data import from the CSV files.

Click
on the “New” button on the top left of the Kettle window, and
select “new transformation”.

From
the Design tab, drag a “CSV input” component. You can find it
under the “input” folder. Once you drop it on the transformation
design panel, double click on it.

A
configuration panel will pop up. There you can insert the path of the
file (1), the delimiter (2) and the file encoding (3). Clicking on the
“get fields” (4) button, Kettle will try and guess the fields
properties, reading the first line of the file. Clicking on the
preview button you can see how the data from the CSV will be managed.

Now
drag from the output folder the “output table” component, and
drop it in the design panel. Connect the CSV file to the table
component by holding the “shift” key and “drawing” a line.

Double click on the table element.

Another
configuration panel will pop up. There you can select the connection
(1), the target schema (2) and the target table (3). In the lower
part of the panel you will see a tab called “database fields” (4),
where you can insert the field mapping (5) between the CSV and the table by simply
clicking on the “enter field mapping button” (6) .

The
“States” transformation is now ready and it is time to import “Users“.
As said in the first part of this article, our “users” table references
the “states” table and we need to perform a lookup on that table in order to
preserve relational integrity.

So,
let’s create a new transformation, and drag a “CSV file input” from the
input folder. Configure the file like you did with the States CSV.

From
the “Lookup” folder, drag and drop a “Database Lookup element”.
Connect the CSV element to this element, then double click on it.

In
the panel that pops up the element can be configured. Select the data
source and the table to look into, then using the first of the two tables (1), it’s possible to map the
keys and conditions to perform the database lookup. Using the second table (2) we can choose which fields from the lookup table we want add to
the output stream. “New name” allows you to use a different
name if the database column name is inappropriate. “Default”
is the value returned (instead of null) if the lookup fails.

Using
the “Get Fields” Button and the “Get Lookup Fields“, Kettle will populate the tables with all the fields from the input stream and
from the lookup table.

Finally
choose a “Table Output” element from the output folder, drag it
to the design zone, then connect it to the lookup element. By double clicking on it, as usual, it is possible to configure the element.
Select the Connection, the schema and the destination table, then in
the “Database Fields” tab, click on the “Enter Field Mapping” button. This will open the mapping tool window, from which it is possible to
choose the associations between the fields of the data stream and the database table columns.

Now this transformation has been completed as well and it is time to import the two
“sub-jobs” into the “master” job.

Save
the two transformations as “.ktr” file, then go to the Job tab,
double click on the first transformation element, and in the
“transformation filename” insert the path of your first “.ktr” file (the States transformation). There is also a “browse” button at the
end of the input line. Once associated the file to the job, you can
click on “ok”. Repeat the operation with the second transformation element,
by pointing to the second “.ktr” file.

Finished!
Just configure the mail object – as usual by double clicking on it – to
receive a mail in case of an error during the data import phase. All you have to do is insert information about your SMTP server and your email address.

Resource Feature Callout 1

ETL with Kettle and Greenplum - Part Two: importing data

Giulio Calacoci

Integrating EDB's AI Accelerator with Nvidia NIM

Representing graphs in PostgreSQL with SQL/PGQ

Postgres in the time of monster hardware

Resource Feature Callout 1

ETL with Kettle and Greenplum - Part Two: importing data

Giulio Calacoci

More Blogs

More Blogs

Integrating EDB's AI Accelerator with Nvidia NIM

Representing graphs in PostgreSQL with SQL/PGQ

Postgres in the time of monster hardware