Let's Workshop an Unplanned Postgres Outage

July 07, 2023

It’s not a controversial statement to say that no database maintenance is without risk. Postgres and its community provides several useful tools to minimize impact of even major overhauls such as server migrations and software upgrades. Yet with such manifold choice comes complexity, planning, playbooks, rehearsals, and all the other things you’d expect in order to ensure a smooth transition in the face of adversity.

Relatively recently in November of 2022, RevenueCat launched an initiative to upgrade from AWS Aurora Postgres 10 to 14, and things went slightly awry. Is there anything they could have done to avoid the aftermath which resulted in a costly five hour outage? Their postmortem is extremely informative, and they definitely learned from the experience, but was it the right lesson? As Postgres experts, we have a unique opportunity to examine their play-by-play, and maybe offer some potentially useful feedback.

What went right

We’re not here to treat RevenueCat as a convenient piñata, exploiting their past turmoil for succulent candies to satisfy some kind of smug superiority complex. They didn’t mention any kind of consultation with an external Postgres company, so it’s possible they were working entirely on their own reconnaissance. Given how the upgrade almost completed uneventfully, their efforts are fairly commendable.

So what did they do correctly?

Copious planning

This is what RevenueCat had to say about their efforts toward a smooth transition:

“Since this is a complex (44 steps) and dangerous operation, over two months were spent by the team preparing for the migration and performing tests and many dry-runs.”

This is honestly commendable, and we wish all companies would treat their databases so methodically. This is precisely what we tell our clients on all engagements if they have the luxury and available resources. Have a playbook with cut-and-paste instructions, well-tested scripts to avoid mistakes, and execute dry-runs until all kinks are worked out. Even then, no plan survives first contact with the enemy, and there’s just something about a production attempt that seems to carry extra untested aspects or incurs the wrath of Murphy’s Law.

Parallel cluster

The entire RevenueCat Postgres 10 to Postgres 14 upgrade procedure was predicated upon a parallel cluster running the new version. They created a second cluster with the exact same node makeup, but running Aurora Postgres 14. This was facilitated by a logical bootstrap and then subsequent logical streaming replication to ensure both clusters remained synchronized.

Such upgrades usually look like this:

Two three node clusters linked with logical replication

This is exactly what we would have told them to do. Aside from a multi-master solution like EDB Postgres Distributed, there is no better way to avoid downtime than to have an entire substitute cluster on-hand. It contains the same data, is a fresh copy unencumbered by accumulated bloat or index fragmentation, is entirely online and writable immediately, and with full read scalability. This has “only” been possible since Postgres 10 natively, or 9.4 while using pglogical, and removes several past pain points.

It’s also possible to use the pg_upgrade utility instead, but this requires full downtime as long as the procedure runs. There are also certain complications related to handling replicas that we’d rather avoid.

Cache warming

This is another issue that is often missed during cluster migrations that RevenueCat deftly sidestepped. Unlike the current cluster which is currently handling reads and writes, the target upgraded cluster is usually completely idle. Even if there are no other problems, switching to a database system with a cold filesystem cache and empty shared buffers would lead to disproportionate amounts of random reads being serviced directly from the underlying storage.

This is often OK if the underlying storage is NVRAM or SSD-tier of some kind, but if it’s hard-disk based, that’s an entirely different issue. When we addressed the demands of OLTP environments, we showed an example where cold HDD storage led to hours of perceived outage simply due to slow query response times.

RevenueCat’s approach was to warm the Postgres cache by reading the last 24-hours of data on the upgraded cluster. This may not be perfect, but it should drastically reduce the amount of cold reads from the underlying storage for the most recent data when the switchover takes place.

What went wrong

Into every life, a little rain must fall. So what happened to RevenueCat given how meticulously they planned their upgrade? It mainly comes down to experience. They identified two fundamental root causes:

  1. A missing ANALYZE on the new cluster, and from there, the query performance dominos began their relentless march to oblivion.
  2. Sequences aren’t copied as part of logical replication, meaning all sequences restarted at 1 in the new cluster.

The first is usually transitory, though painful if tables are large enough, but the second is a critical emergency that can lead to subtle data corruption in the worst case depending on which IDs get reused. Let’s look at these two in isolation.

Ensuring ANALYZE

Postgres with default settings will analyze any table where 10% of its contents have changed. A cluster migration usually qualifies here due to the data copy volume, but background workers are limited and are usually throttled, meaning the process will take time. This can be partially mitigated by making these changes to the postgresql.conf file to give the autovacuum process a bit more breathing room:

autovacuum_max_workers = 5
vacuum_cost_limit = 2000

The existing settings are incredibly conservative to reduce impact on underlying storage throughput, but at the potential cost of maintenance falling behind write activity, or simply being slower than desired. Five workers with a shared cost of 2000 is 400 points per worker rather than 66, making automated stat updates about seven times faster, with more concurrency.

Additionally, Postgres provides a view of user tables so it’s possible to know the last time stats were collected. The easiest way to check for non-analyzed tables is with a query like this:

SELECT schemaname, relname, last_autoanalyze, last_analyze
  FROM pg_stat_user_tables
 WHERE coalesce(last_autoanalyze, last_analyze) IS NULL
    OR greatest(last_autoanalyze, last_analyze) < now() - INTERVAL '1 day';

This query returns any table which has either never been analyzed, or hasn’t been analyzed in the last 24 hours. These are important aspects for a migration where stats should be as recent as possible to ensure success when queries finally arrive. Any table identified here should be manually analyzed.

A great trick to expedite mostly fresh statistics is one used by pg_upgrade. Simply set default_statistics_target to a low value to gather sparse information, and increase it to the desired level in increments. Something like this:

SET default_statistics_target TO 1;
SET default_statistics_target TO 10;
SET default_statistics_target TO 100;

Still, perhaps the best way to proceed is to wait. Their first mistake here was possibly in attempting to perform the entire migration in a single day. Depending on the size of the cluster, the process of initializing the logical subscription alone could take hours or even days. There’s no rush here, as once the initial copy is complete, all new data should stream to the upgraded cluster.

This could have been done days or even weeks in advance, allowing sufficient time for supplementary verification of the upgrade cluster, or to catch potential issues before the transition date. New data would continue to stream from the old cluster to the new one, and logical replication is based on logical WAL decoding, so adds very little overhead while the state of the cluster is in flux.

Synchronizing sequences

Perhaps the most annoying omission in the Postgres logical replication engine is of sequences. If we look at them naively, each can be seen as a single table with one row and one column, so why not copy it? Part of the issue is that sequences transcend transactions, are cached, and are used simultaneously across all sessions. What exactly would we sync, and when? Each invocation of the nextval() function? For which session? If sequence caches are greater than one, do we copy the whole cache block? When do we work that into the logical replication stream?

There are a lot of questions here, and as of yet, few answers. The pglogical extension solved the problem by periodically checking the highest value on the provider system and setting specified sequences on the subscriber to a value 1000 higher. But this is not a good solution so long as the provider is handling writes, as the throughput of new rows could exceed the interval where the subscriber sequences are updated. So even with pglogical, we normally recommend disabling writes, calling the pglogical.synchronize_sequence() function on all sequences for one last forced copy, and then switching to the new cluster.

The safest way to proceed is to simply copy all sequences as part of the upgrade and forget about using any clever automatic copying system at all. We can even generate a script we can run on the subscriber once it’s ready to take over:

           schemaname, sequencename,
           coalesce(last_value + 100, 1)
    FROM pg_sequences
) TO '/tmp/sync_sequences.sql';

Then we can simply execute sync_sequences.sql on the subscriber right before reactivating writes on the upgraded cluster, and everything would be ready. We even worked in a comfortable buffer of 100 IDs to clearly delineate the shift between the clusters.

The waiting game

Probably the worst mistake RevenueCat made was through inaction. Note that their migration was from Aurora Postgres version 10 to version 14. Postgres releases a major version every year, which makes its way to Amazon in their Aurora product in relatively short order.  This means that this upgrade was set to cover a four year span of updates. If this upgrade had been a practiced, yearly cadence, their already impressive playbook would have been more thoroughly battle-tested.

There’s a very real assumed risk by falling so far behind. Postgres only patches major versions for five years, and Amazon seems to amplify this restriction by removing old versions from their catalog. A quick check of the us-east-1 region shows that Aurora supports versions 11.9 to 15.2, with no mention of 10.x at all. This is because support for 10.x ended on January 31st, 2023, roughly two months after RevenueCat completed their upgrade.

If there had been some problem which prevented the upgrade, one that caused a delay of a few more months, Aurora would have forced an in-place upgrade with their automated tooling. This tooling is essentially a wrapper for pg_upgrade, and requires a full outage to complete. It also requires a manual ANALYZE as part of this process, but Amazon makes this clear in their documentation, so it wouldn’t have been a surprise. And an in-place upgrade would have faithfully preserved sequence positions. Depending on how quickly snapshots complete, such an upgrade may have actually resulted in less down-time.

It’s possible that some of the migration issues were caused by a last minute rush to “beat out the clock” so to speak. Maybe the procedure was mostly ready, and the series of dry runs were probably sufficient, and they couldn’t risk waiting any longer. It’s entirely possible the end-of-life for version 10 forced their hand, given there was already a four-year delay. There’s no real way to know for sure, and we don’t really need to speculate. Even if it didn’t happen to RevenueCat this way, we’ve seen examples of exactly that story, and it always ends the same way.

Always learning

The fact that RevenueCat published a public post-mortem of their experience shows how willing and capable they are as a service. You can bet they learned a harsh lesson during that outage, and they clearly have a good idea how to avoid it again in the future. Their new playbook probably has one or two big bold sections labeled “DON’T FORGET THIS” for their next foray into the unknown.

In the future though, we also hope they hand their playbook to some Postgres consultants for review. It doesn’t have to be EDB or any of the major players, just someone with a good track record who can act as an objective third party to catch things that they might have missed. There’s a good chance that may have been all that was necessary to avoid the costly outage RevenueCat experienced.

And finally, never put off database upgrades. We know things happen and maintenance can fall between the cracks, but databases are a fast-moving technology and vital to business operations. Five years may seem like a long time, but it goes by quicker than you think. Plan for an annual or biennial upgrade at minimum. Even major version upgrades should be so routine that they’re considered boring. This is especially true when relying on a cloud provider who may retire old versions outright and force your hand.

And to RevenueCat, we just want to say you actually did alright. We’ve witnessed and experienced much more harrowing transitions that included terrifying events like corruption or transaction ID wraparound. Your story, as painful as it probably was at the time, is a relatively tame one. We’re just glad you chose Postgres, as is the rest of the community.

Power to Postgres!

Share this

Relevant Blogs

The limitations of LLMs, or why are we doing RAG?

Despite powerful capabilities with many tasks, Large Language Models (LLMs) are not know-it-alls. If you've used ChatGPT or other models, you'll have experienced how they can’t reasonably answer questions about...
June 17, 2024

Finding memory leaks in Postgres C code

I spent the last week looking for a memory leak in Postgres’s WAL Sender process. I spent a few days getting more acquainted with Valgrind and gcc/clang sanitizers, but ultimately...
March 27, 2024

More Blogs