Linux filesystems and PostgreSQL checkpoint benchmarks

February 18, 2011

Following up on last month’s Tuning Linux for low PostgreSQL Latency, there’s now been a giant pile of testing done on two filesystems, three patches, and two sets of kernel tuning parameters run.  The result so far is some interesting new data, and one more committed improvements in this area that are in PostgreSQL 9.1 now (making three total, the other two are monitoring patches).  I’ll be speaking about recommended practice next month during one of my talks at PostgreSQL East, and I’ve submitted something in this area for May’s PGCon too.  Here I’ll talk a bit more about the dead ends too, while those memories are still fresh.

The basic problem here is that the way PostgreSQL uses the operating system cache when writing allows large amounts of data to accumulate.  The result when database checkpoints finish can be long delays while waiting for that data to write.  It turns out the pgbench program that comes with PostgreSQL is really good at creating this problem, so that’s what I used for all the tests.  The means questions I set out to answer were:

  1. Does changing from the old ext3 filesystem really show a performance improvement on database tasks?  I wrote something about the Return of XFS on Linux last year that showed a nice improvement on simple benchmarks.  That doesn’t always translate into database improvements though.
  2. Do the recent Linux dirty_bytes and dirty_background_bytes tunables really improve worst-case latency?
  3. Which of the database changes suggested for improving behavior here actually work?

You can see all of the test results if you want to check out the raw data.  What was changed for each test set is documented, and if you drill down into an individual test you can see the database parameters used and some other basic OS information.  That web page is what comes out of my pgbench-tools testing program, if you’d like to try this sort of thing yourself.

The results weren’t very surprising, but they were interesting.  All of the tests here were done with two database sizes.  At a smaller database size (scale=500, about an 8GB database that easily fit in the server’s 16GB of RAM), ext3 managed 690 transactions/second, while at twice that size (scale=1000, about a 16GB database) it was much more seek bound and only managed 349 TPS.  XFS increased those two numbers to 1757 TPS and 417 TPS–a 255% and 19% gain, respectively.  Even better, the worst-case latency for a single transaction dropped from the 34 to 56 second range (!) to the 2 to 5 second one.  While even 5 seconds isn’t great, this is a synthetic workload designed to make this problem really bad.  The ext3 numbers are so terrible you’re still really likely to run into a nasty problem here, even though I was actually seeing better behavior on that filesytem than I’ve seen in earlier kernels (this was done with 2.6.32).

Round 1:  XFS wins in a landslide.  I cannot recommend ext3 as a viable filesystem on Linux systems with lots of memory if you plan to write heavily; it just doesn’t work in that context.  This server only had 16GB of RAM, so you can imagine how bad this problem is on a serious production server here in 2011.

Next up, the dirty_bytes and dirty_background_bytes tunables.  These two improved latency quite a bit on ext3, at the expense of some slowdowns.  The worst of those, slowed maintenance time running VACUUM, you don’t see in the test results themselves; I already discussed that in my earlier blog entry.  On XFS, tuning these parameters down is a performance disaster.  At the smaller database scale, TPS performance drops 46%, and on top of it latency actually gets worse.

Round 2:  Don’t expect any miracles from dirty_bytes or dirty_background_bytes.  They do seems to have some postiive effect in some circumstances, but the potential downside is big too.  Make sure to test carefully, and include VACUUM in your testing, before adjusting these two downward.

Next up, I ended up evaluating three patch ideas to PostgreSQL as part of this last CommitFest:

  • Spread checkpoint sync to disk (fsync) calls out over time.  We’d seen some success with that on a busy client server when combined with some improved handling of how other sync operations were cached by the database
  • Compact fsync requests.  This idea spun off of the first one and turned into a patch written by Robert Haas.  The idea is that clients trying to sync data to disk could be competing with the checkpoint writing.  What the patch does is allow clients to clean up the queue of fsync requests if they ever find it full.
  • Sort checkpoint writes.  The concept is that if you write things out in the order the database believes things are stored on disk, the OS might write more efficiently.  This patch showed up a few years ago with some benchmark results suggesting it worked, but at the time no one was able to replicate the improvements.  The idea fit into the rest of the work well enough that I evaluated it again.

Round 3:  After weeks of trying all this out, the only approach out of this set that showed the improvement at almost all workload sizes was the fsync compaction one.  The original spread checkpoint sync code helped in this area some, but the specific implementation that is now committed for 9.1 worked even better.   It was a nearly across the board 10% gain on most of the write-heavy tests I ran.  That’s a great improvement for PostgreSQL 9.1, and it should completely eliminate a problem that we’ve seen cause a much greater slowdown on production systems here.
The rest of the ideas here didn’t get such a positive evaluation after heavy benchmarking, so for now those go back on the shelf.  I’ll continue to gather data here–some ext4 tests are the next logical thing to try–and then return to development again.  Getting a 10% gain on some difficult workloads is certainly nice, but there are still far too many worst-case behaviors here to consider checkpoint sync issues a closed subject.

Share this

More Blogs