Jul 2, 2020
A PostgreSQL Database Failure and a Failover
A customer reached out to the Support team on a random Tuesday, some months ago. They had experienced a failure of their primary postgresql server, and had manually promoted their hot standby server. For the sake of levity, we’ll call the original primary Larry and the original standby Moe.
When they restarted the server on which Larry (the original primary) ran, the hardware itself came back up--but the database team didn’t know why it’d gone down in the first place. There was no planned maintenance, no power outage. While they’d seen some SMART errors on the media, it looked as though they were ultimately in good health.
So, the hardware was back up, but the postgresql service wouldn’t start. They checked the filesystem, and ran fsck -A. The results were not promising:
fsck exited with status code 4
After some storage team heroics, for which we weren’t involved, the customer’s team found that the root cause was silent disk corruption on a solid-state cache drive.
The good news was that the database was the only application on the machine itself. So, given that they’d been running streaming replication in a synchronous state between Larry (the original primary) and Moe (the original standby), they elected to replace the drive, wipe the machine clean and add Larry back as a hot standby.
The next day, Wednesday, they began to experience issues on the application--queries, for records to which they had already received commits on Tuesday, were failing on Moe. So they did what any customer may: they reached out to EDB Support.
We spoke with the customer, and while waiting for logfiles, we asked the usual questions:
- What was the sequence of events?
- What are the symptoms?
- What steps have you taken?
At this point, we knew much of what you now know. However, we had one sneaking suspicion that was looming. When they sent their postgresql.conf file, we saw the problem:
While we could not confirm what the scale was of the data that was now gone, we could certainly confirm why it was gone. The customer was only leveraging synchronous replication as far as the local Write-Ahead Logging (WAL) on Larry.
Now, this ensures that two good things to happen:
- The query has successfully reached the backend process
- The query has successfully been written to disk on the local WAL
However, it doesn’t ensure that three other good things happen:
- The transaction has successfully been written to the OS of the standby server(s)
- The transaction has successfully been written to disk on the standby server(s)
- The transaction has successfully been written to the database on the standby server(s)
So why does that matter in this case? Well, the database itself didn’t fail. This is good. What wasn’t good? The “other good things” above refer to a standby server. At the time, that standby server was Moe. Moe’s now the primary server. When the failure happened, the transactions which were on their way from Larry’s walsender process to Moe’s walreceiver process were lost. Updates and inserts alike.
Currency or Consistency
The business had a tough call to make here; either continue with a window of an unknown number of missing transactions, or recover to a last-known state of consistency. Because the data was locational, it was a crucial decision to make as the delta of changes made in the time since the original forced promotion and this juncture was not small. Hundreds of thousands of records had been changed in the time since, amounting to many gigabytes of change.
Since the load was consistent, we suggested using the current lag between the standby server and the primary server might give them an idea of the magnitude of the records affected. This, as well as a lot of other meaningful statistics, can be estimated the long way around by using the contents of both pg_stat_replication and pg_stat_wal_receiver. However, there are functions which exist to make our lives easier, and one of these is called pg_wal_lsn_diff, which looks like: pg_wal_lsn_diff(lsn pg_lsn, lsn pg_lsn) / pg_xlog_location_diff (location pg_lsn, location pg_lsn). This function is found under the System Administration Functions section of the documentation, found here. It returns the difference between WAL locations on the primary and the standby, measured in bytes.
After working with the customer to execute this several times, we measured that the typical lag over a 30-minute window was approximately 64MB. 64 megabytes of exposure was the rough estimate. When compared to the gigabytes of changes made in the time in which Moe had been the new primary, the new primary was kept as-is, with records to be identified through application logging over time.
While EDB Support had done everything it could do to help the immediate problem, the underlying problem was more fundamental. We now explained to the customer why this happened, and how to evaluate whether or not they needed to be exposed to a risk like this in the future.
Is it Synchronous or Synchronous?
PostgreSQL’s streaming replication gives you the freedom to pick how consistent your business requires the primary and standbys to be, by using both synchronous_commit and synchronous_standby_names. How?
Synchronous_commit, articulated in the documentation here, allows you to define how persistent you want your commits to be:
- synchronous_commit=off — commits are sent to the application as soon as the transaction is handled by the backend process, but before the transaction is flushed to the WAL. This means that even a single server can hypothetically suffer data loss. Replication itself is ignored.
- synchronous_commit=remote_write — Commits are sent by the primary to the application once operating system of the standbys defined in synchronous_standby_names has sent a write-confirmation of the transactions sent by the primary walsender. This takes us one step further
- synchronous_commit=on — Commits are sent to the application once the data is flushed to the WAL. This applies to both the primary in a single-server configuration as well as the standbys in streaming replication. This is actually one step further than remote_write, since it provides commits only after it’s made it to the standby WAL’s.
- synchronous_commit=local — Commits are sent to the application once the data is flushed to the WAL on the primary node. Replication itself is ignored. Think of it as a blend between off and on, where it ensures local WAL’s while ignoring replica consistency.
- synchronous_commit=remote_apply — Commits are sent by the primary to the application only after the standbys defined in synchronous_standby_names have confirmed that the transactions have been applied beyond the WAL, to the database itself.
Here’s where it gets a little trickier, and you can find the full documentation here. This follows one of three syntaxes:
- [FIRST] num_sync (standby_name [n1,n2,...]) — Where num_sync is the number of synchronous standby nodes that are required for a quorum reply, and standby_name is the name of a standby server, this format creates a priority-based quorum commit. Those identified in the num_sync value are required to be synchronous, and those after that subset may be synchronous but are not required for a successful commit. Example: [FIRST] 2 (larry,curly,moe,shemp) would require that confirmations are provided from curly and moe in order for larry, the primary, to send a commit. ‘Shemp’ may or may not be synchronous at the time that this commit would be sent.
- ANY num_sync (standby_name [n1,n2,...]) — ANY will allow the confirmation to be provided by the number of standbys identified in num_sync irrespective of their ordering. Example: ANY 2 (larry,curly,moe,shemp) would accept confirmations from any combination of two standbys of these three servers, and once num_sync has been reached, the primary would send the commit.
- Standby_name [n1,n2,...] — This format is just a holdover from PostgreSQL 9.6 and earlier, and is honored in the same way as the FIRST syntax.
So, if you distill all of this, you come to the following conclusions:
- I should totally have synchronous_commit = write_apply.
- I should definitely be using synchronous_standbys too based on my solution’s priorities.
Safety Isn’t Free
PostgreSQL optimization is a balance of costs and gains. Each change you make can affect you in myriad ways, but two of the most common measures are TPS and lag. When you require more conditions to be met in order to provide the commit from your primary back to the application, it costs time and resources. As our customer learned, they could gain resilience, but depending on how much stress they’ve put on the machines, their network, the application, or the PostgreSQL internals, it might mean a lower level of performance.
Several years ago, an article was produced using a pgbench test of what implications these configurations would have, and it was significant. With synchronous_commit set to “off”, they saw over 2x the performance that was observed when using synchronous_commit=remote_apply. The most common configuration, synchronous_commit=on, saw a figure that was more in the middle of these three.
The impacts of each of these will vary (the famous “Your Mileage May Vary”) depending on what sort of load you’re producing, the rate at which this load generates WAL files, the speed of the network, and the machine resources for all servers in the cluster.
But How Do You Decide?
Well, for this customer, it was simple. How did they avoid the situation they ran into here, with the least impact to performance? The answer was to use synchronous_commit=on. If they added any nodes to the 2-node cluster, they would need to be very thoughtful about how they used standby_names as well.
For users in general, you have to consider the business requirements and technical limitations of the solution, and find a balance that best suits the reality of your implementation. As a general rule of thumb, every solution wants to…
- Be fast
- Be available
- Be persistent
- Be durable
- Be efficient
The question becomes, how fast? How available? How persistent? How durable? These attributes ultimately define your efficiency in a much bigger picture, whether it be through the lens of cost, time, effort or risk. At the end of the day, you don’t want to have to choose between Larry’s failed memory or Moe’s availability. You need the whole cast to make it work, and how you do that is a fine measure of the best outcome for your system.
Join Postgres Pulse Live!
Our goal is to continue to leverage our experiences with our customers to bring meaningful insights for all PostgreSQL users, both in the problems themselves as well as in the approach we take to the solution. Looking for more help just like this? Here you can find all things Postgres Pulse, including all of our blog posts and our YouTube series.
Join us on Monday, July 6th, for our next Pulse Live Session! Remember, we’re open to any questions you may have around PostgreSQL at all. You can ask your questions via email at email@example.com, #PostgresPulse on Twitter, or during the event right here.