How to handle duplicates and unique index conflicts when importing data into EDB Postgres Advanced Server with EDB Loader

Advanced How-tos Intermediate

Jeevan Ladhe Technical Architect

Like most of the database vendors provide a way to bulk copy from files, EnterpriseDB has a nice tool at hand called EDB*Loader for EDB Postgres Advanced Server. It is a high-performance bulk data loader that provides an interface compatible with sql loader for EDB Postgres Advanced Server.

You are reading this blog is an indication that by some means you are already familiar with EDB*Loader, and if you have already been using EDB*Loader - that’s great and if not, don’t worry, you will learn that too as we walk through the examples. Also, to learn more you can refer the EDB*Loader official documentation.

At times you might have wondered - well, I want to load some CSV file into a table that might have certain unique indexes, but know that the input file might contain some duplicate records. You just want to insert the records which can be inserted without causing a duplicate key violation. But, that was just not possible. Before EDB Postgres Advanced Server  version -13, EDB*Loader would abort the load if it hit the duplicate records violating unique indexes like what is shown in below example:

 

Create a student table:

CREATE TABLE student(rollno int unique, name varchar(20));

 

Now, let’s say we have following CSV input file:

$ cat students.csv 

1,RAHUL
2,JOHN
3,PEPPA
1,SANJAY
4,KELLY
2,GEORGE
5,IMRAN

 

And our control file looks like this:

$ cat students.ctl
LOAD DATA
INFILE 'students.csv'
BADFILE 'students.bad'
INSERT INTO TABLE student
FIELDS TERMINATED BY ','
( rollno, name )

 

Now, attempt to load the students.csv using edbldr command:

$ edbldr -d edb userid=jeevan/edb CONTROL=students.ctl
EDB*Loader: Copyright (c) 2007-2020, EnterpriseDB Corporation.

edbldr error: ERROR:  duplicate key value violates unique constraint "student_rollno_key"
DETAIL:  Key (rollno)=(1) already exists.

 

Oops, the load simply got aborted, and the table did not get any records in:

edb=# SELECT * FROM student;

 rollno | name 
--------+------
(0 rows)

 

But, EDB Postgres Advanced Server version -13, now comes with an improved EDB*Loader that gives a way to address this specific requirement. From version -13 onwards it added a new parameter “HANDLE_CONFLICTS” to the edbldr command, so that the load continues to the end (barring any other errors) even if there are unique index violations. This option needs to be set to “TRUE” while invoking edbldr command. The ignored records due to the unique index violation will be logged into the ‘.bad’ file, so if need may be you can analyze what all records were rejected.

Let us just try to load the same “student.csv” as above, but now execute the “edbldr” command with HANDLE_CONFLICTS set to TRUE:

$ edbldr -d edb userid=jeevan/edb CONTROL=students.ctl HANDLE_CONFLICTS=TRUE
EDB*Loader: Copyright (c) 2007-2020, EnterpriseDB Corporation.

Successfully processed (7) records

 

Let us check what all got inserted in the table:

edb=# SELECT * FROM student;                                                                        
 rollno | name  
--------+-------
      1 | RAHUL
      2 | JOHN
      3 | PEPPA
      4 | KELLY
      5 | IMRAN
(5 rows)

 

And the bad file will tell us what are those records conflicted and got rejected:

$ cat students.bad
1,SANJAY
2,GEORGE

 

If that was not enough, the feature works smoothly even if you have multiple unique indexes on your table.

Let’s say we have a table “sample” with two columns “col1”, and “col2”, both being defined as unique, then EDB*Loader would still continue loading to the end, ignoring any conflicts either due to “col1” or “col2”. Here is an example to walk-through:

edb=# CREATE TABLE sample(col1 int unique, col2 int unique);                                                                                                                                            
CREATE TABLE

$ cat sample.dat
111
222
333
112
444
533
666



$ cat sample.ctl 
LOAD DATA
INFILE 'sample.dat'
BADFILE 'sample.bad'
INSERT INTO TABLE sample
(
col1 POSITION(1:1),
col2 POSITION(2:3)
)

$ edbldr -d edb userid=jeevan/edb CONTROL=sample.ctl HANDLE_CONFLICTS=TRUE

EDB*Loader: Copyright (c) 2007-2020, EnterpriseDB Corporation.
Successfully processed (7) records

$ epsql 
edb-psql (13.0.1, server 13.0.1)
Type "help" for help.

edb=# SELECT * FROM sample;

 col1 | col2 
------+------
    1 |   11
    2 |   22
    3 |   33
    4 |   44
    6 |   66
(5 rows)

$ cat sample.bad 
112
533

 

It is worthwhile to note that this feature only works with a conventional load method and not with DIRECT load path. If both DIRECT and HANDLE_CONFLICTS are set to TRUE, edbldr will throw an error :

$ edbldr -d edb userid=jeevan/edb CONTROL=students.ctl HANDLE_CONFLICTS=TRUE DIRECT=TRUE
EDB*Loader: Copyright (c) 2007-2020, EnterpriseDB Corporation.

edbldr error: ERROR:  HANDLE_CONFLICTS option is not supported with direct load

 

So, this new feature of EDB*Loader can get you around the problem of ignoring the duplicate records and continue the load. To be backward compatible the default value of HANDLE_CONFLICTS is set to FALSE.

I am wishful that this surely pleases users who have been annoyed by EBD*Loader getting aborted on the duplicate key conflict error.

 

Jeevan LadheTechnical Architect

Jeevan Ladhe is a Technical Architect at EnterpriseDB. He contributes to the Postgres community feature development, including bug fixes and reviews. He currently works as a database developer for EDB Postgres Advanced Server and he is responsible for feature development and bug fixes. Jeevan is also responsible for maintaining and adding features to the EDB Backup and Recovery Tool (BART). Prior to joining EDB in 2015, Jeevan worked at Sybase (now SAP) for about seven years on implementation of database technologies such as high availability, database extended caching, database drivers and bulk copy. Before joining SAP he did his 6 months internship at Microsoft, IDC. Jeevan holds a Master’s in Computer Science from the Department of Computer Science, Pune University.