Swoop de Dupe

Elein

May 29, 2019

The problem: duplicate rows

Once upon a time, there was a database person who knows about referential integrity and its importance for creating database tables. The interesting part is that referential integrity, including primary keys, enable keeping with the standard of not having duplicate rows.

However, they just wanted a private little table on their private little database to hold their contacts. No primary keys were needed because none of the names were going to be the same. The only thing needed was to avoid recreation of the table. No formatting was needed with the phone or email data. It was just a simple lookup table.

create table if not exists contacts (
fname text,
lname text,
phone numeric(11),
email text
);

In the initialization script, the \copy command was added in order to copy the tsv file with the contacts into the new table. It failed a number of times due to typos, bad filenames, wrong version, new keyboard, drinking while typing, being interrupted, bad data format, the dog ate my homework, issues with copy vs. \copy… And while wrangling these errors, some of the copies failed, and some succeeded. There are many ways to get unwanted duplicate rows in your table.

? ? ? ? ? ? ?

Knowing they had 26 friends, one for each line in the file, they checked out the data.

=# select count(*) from contacts;
count
-------
104
(1 row)

OMG. What a mess. The easy solution to this problem is to drop the table and successfully copy in the data once and only once. But sometimes duplicates happen inadvertently in live and important tables where drop/recreate/reload is not an option. It is due to not creating a proper primary key. (See the latest rant on primary keys and serials in your favorite mailing list.)

The Solution

Let us see how to de dupe this little table and you can see how to apply it to more crucial tables.

1. Add a unique identifying column to your table. Make it a SERIAL PRIMARY KEY.
2. Update the identifying column to its default. This will add a unique value in the column for each record.

alter table contacts add column uid serial primary key;
update contacts set uid = default;
select uid, fname, lname, phone, email from contacts limit 5;

uid	fname	lname	phone	email
79	abba	anderson	15105551234	abba@gmail.com
80	bobby	bronson	15105551234	bobbyB@hotmail.com
81	carla	corbani	15105551234	corbani@ibm.com
82	daryll	dunkirk	15105551234	ddunkirk@twitter.com
83	emilia	every	15105551234	emiliae@wsu.edu

(5 rows)

3. Here are some things to notice.

In this little query, data in one set is distinct by fname, lname. We can partition by those columns to find duplicates. Failing a couple of columns that are distinct in the set, we could use the entire row to partition.

The row_number() function returns the number of the current row within its partition, counting from 1. It requires an over (partition by col[, col]) statement.

4. Number the sets of duplicates. Using the row_number and over (partition) construct, we can identify all rows uniquely and number the rows in a set of duplicates.

select uid, fname, lname, 
       row_number() over 
       (partition by fname, lname order by fname, lname) 
from contacts limit 9;

uid	fname	lname	row_number
105	abba	anderson	1
79	abba	anderson	2
131	abba	anderson	3
80	bobby	bronson	1
132	bobby	bronson	2
106	bobby	bronson	3
133	carla	corbani	1
81	carla	corbani	2
107	carla	corbani	3

(9 rows)

5. Delete duplicate rows
With the information in the above query, we can delete the rows by uid where the row numbers in a set are greater than one. This leaves us with only row numbers 1 for each set.

 delete from contacts
 where uid in (select uid
 from (select uid, fname, lname,
              row_number() over 
       (partition BY fname, lname order by fname, lname) as rnum
       from contacts ) t
 where t.rnum > 1;

6. Check your data…just in case

=# select count(*) from contacts;
 count
 -------
 26
 (1 row)

7. Clean up

Always clean up any extra columns, tables, temporary functions, etc when completing your tasks. This cleanup will only require the dropping of the newly added column.

alter table contacts drop column uid;

Script template: de Dupe

This is a template of the entire script. Just substitute the bracketed items with your table and column names.

alter table <table> add column uid serial primary key;
 update <table> set uid = default;
 delete from <table>
 where uid in (select uid
 from (select uid, <cola>, <colb>,
              row_number() over (partition by <cola>, <colb>
              order by <cola>, <colb>) as rnum
       from <table> ) t
       where t.rnum > 1;
 alter table <table> drop column uid;

Share and enjoy good referential integrity.

elein.mustain@2ndquadrant.com — GeneralBits

DATA & AI SOVEREIGNTY

DATA & AI SOVEREIGNTY

Swoop de Dupe

Elein

The problem: duplicate rows

The Solution

Script template: de Dupe

The Cost of Downtime and How to Reduce Its Impact on Your Organization

Software Bill of Materials (SBOM): The Traceability You Need

PostgreSQL Contributor Story: Manni Wood

DATA & AI SOVEREIGNTY

DATA & AI SOVEREIGNTY

Swoop de Dupe

Elein

The problem: duplicate rows

The Solution

Script template: de Dupe

More Blogs

More Blogs

The Cost of Downtime and How to Reduce Its Impact on Your Organization

Software Bill of Materials (SBOM): The Traceability You Need

PostgreSQL Contributor Story: Manni Wood