Pop quiz: Which is the fastest way to bulk load lots of data into an empty table in Postgres?
- Load all data into the table, leaving all indexes and constraints enabled.
- Drop all of the table's indexes and constraints, then load all of the data into the table, then re-enable the table's constraints and re-build the table's indexes.
If you guessed number 2, you have likely read the Populating a Database page of Postgres's documentation. And you probably have plenty of hands-on experience that reinforces what Postgres's docs say.
I still remember a fairly routine day at a past job where I had two similar table bulk load tasks: one was fast, while the other was excruciatingly slow --- the one I had forgotten to drop the indexes from! I fixed my oversight and moved on with my work, but that gotcha moment stuck in my memory: leaving constraints on tables makes for very slow bulk loading.
In fact, I recalled this very memory when I learned that Debezium , which is used by EDB's own Data Migration Service, recommends something very interesting: drop all constraints on your tables and enable them after the bulk data load... except leave the primary key in place.
I immediately thought this would be terrible for performance.
But time moves on, technology improves, and having some actual performance numbers would inform me how (un-)founded my fears really were.
So I made a little setup to try to measure the difference. I already have Postgres installed on my laptop, and I already have a USB-attached multi-terabyte hard drive: it was enough to get started.
My experience has shown me that these performance differences are more obvious with large tables. So, I used HammerDB to create a TPC-C database with a bunch of warehouses, so that the order_line
table would have just over 300 million rows in it. I then used psql
's \copy
command to copy the table to a copy file that I could use as my load data for an empty order_line
destination table.
For the curious, this is what HammerDB's TPC-C order_line
table looks like:
tpcc> \d order_line
Table "public.order_line"
┌────────────────┬──────────────────────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────────────┼──────────────────────────┼───────────┼──────────┼─────────┤
│ ol_delivery_d │ timestamp with time zone │ │ │ │
│ ol_o_id │ integer │ │ not null │ │
│ ol_w_id │ integer │ │ not null │ │
│ ol_i_id │ integer │ │ not null │ │
│ ol_supply_w_id │ integer │ │ not null │ │
│ ol_d_id │ smallint │ │ not null │ │
│ ol_number │ smallint │ │ not null │ │
│ ol_quantity │ smallint │ │ not null │ │
│ ol_amount │ numeric(6,2) │ │ │ │
│ ol_dist_info │ character(24) │ │ │ │
└────────────────┴──────────────────────────┴───────────┴──────────┴─────────┘
Indexes:
"order_line_i1" PRIMARY KEY, btree (ol_w_id, ol_d_id, ol_o_id, ol_number)
I ran a few tests of loading 300 million rows into this table with the primary key present, and then I ran a few tests where I dropped the primary key and re-built it later.
In both tests, I used psql
's \copy
command to copy in the copy file I had created above. For the tests where I had to rebuild indexes, I cranked up maintenance_work_mem
to take advantage of the RAM on my machine.
Doing it the recommended way took roughly 10 minutes to load the data, and 20 minutes to build the primary key index. So, 30 minutes in total.
Doing it the "wrong" way (leaving the primary key in place) took 20 minutes to load the data, and, of course, 0 minutes to build the primary key index because that step didn't need to be done. So, 20 minutes in total.
Surprise! It was faster to load 300 million HammerDB order_line
table rows with the primary key in place.
I was not expecting this result.
This is where a rabbit hole immediately opened up: There were so many ways I thought I could test this so that I could get the "correct" result that we always drop all constraints before bulk loading data.
The first thing I figured was that 300 million rows of HammerDB order_line
takes less than 30 GB on disk, and my laptop has 32 GB of RAM, so maybe I was getting some sort of "it fits in RAM" boost.
So I had HammerDB create me an order_line
table with 900 million rows, which took just over 100 GB on disk: much more than my computer's 32 GB of RAM. This time, I would get the expected result.
Nope! It took roughly an hour to load 900 million rows to an order_line
table with a primary key in place, and roughly an hour and a half with no primary key in place and having to build the primary key after the data load.
Should I go further down the rabbit hole?
I had already tuned my GUCs to take advantage of my computer's 32 GB of RAM.
I had already stopped every other program on my machine: no browsers were open during the test, for instance.
I had the source copy file on a different disk (a USB SSD) than the destination table (a USB HDD).
My data weren't going over a network (just the internal bus on my computer) so that probably wasn't a bottleneck.
Maybe I should put WAL on a different disk from the order_line
table.
Maybe I would get a different result if my table had wider rows.
Maybe my destination disk should be an SSD, the same way my source disk was.
Then I stepped back and realized something. There wasn't anything particularly unique about my setup. There are indeed Postgres 17 installations out there that have everything on one disk (WAL and heap files), that have access to 32 GB of RAM (or less!), and that have their GUCs tuned at least semi-competently. Most of them don't even have the luxury of having their source file on a USB-attached SSD; most of them will be streaming their input data over a network.
So what is the takeaway of my little experiment? I think it is safe to say this: while best practices exist for a reason, and prior experience counts for a lot, when performance really matters, do some test runs and take some measurements! Because general advice is just that: general; your specific situation may be different. Measuring things may contain surprises --- even pleasant surprises!
And if I have to load 900 million rows of data into my Postgres database using EDB's Data Migration Service, I don't think table primary keys are going to be a performance bottleneck.