Linux's fsync() woes are getting some attention's picture
Author: Robert Haas

In two weeks, I'm headed to LSF/MM and the Linux Collaboration Summit, by invitation of some Linux kernel hackers, to discuss how the Linux kernel can better interoperate with PostgreSQL.  This is good news for PostgreSQL, and hopefully for Linux as well.  A post from Mel Gorman indicates that this topic is attracting a lot of interest, and that MariaDB and MySQL developers have now been invited to participate as well.  His summary of the discussion so far quotes some blunt words from one of my posts:

 IMHO, the problem is simpler than that: no single process should
 be allowed to completely screw over every other process on the
 system.  When the checkpointer process starts calling fsync(), the
 system begins writing out the data that needs to be fsync()'d so
 aggressively that service times for I/O requests from other process
 go through the roof.  It's difficult for me to imagine that any
 application on any I/O scheduler is ever happy with that behavior.
 We shouldn't need to sprinkle of fsync() calls with special magic
 juju sauce that says "hey, when you do this, could you try to avoid
 causing the rest of the system to COMPLETELY GRIND TO A HALT?".
 That should be the *default* behavior, if not the *only* behavior. 

Long-time PostgreSQL users will probably be familiar with the pain in this area.  If the kernel doesn't write back pages aggressively enough between and during checkpoints, then at the end of the checkpoint, when the fsync() requests start arriving in quick succession, system throughput goes down the tubes as every available write cache fills up and overflows.  I/O service times become very long, and overall performance tanks, sometimes for minutes.  A couple of things have been tried over the years to ameliorate this problem.  Beginning in PostgreSQL 8.3 (2008), the checkpoint writes are spread out over a long period of time to give the kernel more time to write them back, but whether it actually does is up to the kernel.  More recently, attempts have been made to spread out the fsync() calls as well as the writes, but I'm not aware that any of these attempts have had fully satisfying results, and no changes along these lines have been judged sufficiently promising to justify committing them.  In one sense, what PostgreSQL really wants to know is whether starting the next fsync() now is going to cause the I/O subsystem to become overloaded, and there's no easy way to get that information; in fact, it's not clear that even the kernel has access to that information.   (If it did, we'd also hope that it would take care of throttling the fsyncs a little better even in the absence of specific guidance from PostgreSQL.)

The other major innovation that I think has been broadly useful to PostgreSQL users, dirty_background_bytes, dates to 2009.  This was an improvement over the older dirty_background_ratio due to the fact that the latter couldn't be set low enough to keep the dirty portion of the kernel's write cache as small as PostgreSQL needed it to be.  But it's unclear to what extent any further progress have been made since then. Mel Gorman points out in his post that the behavior in this area was changed significantly in Linux 3.2, but many PostgreSQL users are still running older kernels that don't include these changes.  RHEL 6 ships with the 2.6.32 kernel, which does not incorporate these changes, and RHEL 7, which is apparently slated to ship with 3.10, is still in beta.

It seems clear based on recent discussions that the Linux developer community is willing to consider changes that would make things better for PostgreSQL, but their ability to do so may be hindered by a lack of good information.  It seems unlikely that all of the problems in this area have been fixed in newer releases, but more and better information is needed on the extent to which they have or have not.  Perhaps someone's already done this research and I'm simply not aware of it; pointers are appreciated.