One of the common complaints regarding Postgres—especially in high data volume production environments—is regarding the necessity of vacuum maintenance. If we want to build a High Availability stack, this means keeping ahead of all types of maintenance so we are never surprised. Many of us have experienced the impotent despair when our desktop operating system forces a system update and disrupts our workflow. Similarly, imagine our production Postgres cluster shuts down and demands an immediate manual VACUUM to forestall XID wraparound. Or perhaps a table becomes so bloated that queries are orders of magnitude slower than they should be.
We covered several ways to prevent unexpected wraparound in our last foray into managing the global transaction ID (XID). Experienced Postgres admins were probably wondering why we didn’t mention ways to manage Autovacuum itself, and they’d be right to do so. Keeping up with this topic is really a two-phase process. This consists of preventing esoteric problems from stalling XID advancement, and ensuring VACUUM is properly configured for future needs.
This week we will be focusing on the latter.
Releasing the Throttle
The default settings for Postgres Autovacuum are extremely conservative in order to prevent excessive storage IO from disrupting standard operations. While this approach is commendable in most respects, even moderate activity seen in a production environment can easily overwhelm these defaults.
The concept of disallowing background activity from interrupting or slowing down user sessions is a sound one. So rather than crank our settings up to 11 and consider the problem solved, we benefit from a more subtle approach. Take for example the autovacuum_vacuum_cost_limit parameter.
While an automated background vacuum is operating on a table and its associated indexes, it uses a cost-based approach to determine how aggressive it can be. During every operation, a VACUUM will accumulate a score based on several heuristics:
- vacuum_cost_page_hit - The cost incurred when vacuuming a page in the shared buffer. Since this is the easiest action to perform, it’s used as the basis for all other costs, and is set to 1.
- vacuum_cost_page_miss - The cost incurred when vacuuming a page that must be retrieved from disk. The default considers this twice as expensive as a regular page from the shared buffer.
- vacuum_cost_page_dirty - The cost incurred when vacuum modifies a page and flushes it to disk. This is a much more expensive operation than even a page miss, and has a similarly increased score of 20.
Once the vacuum score reaches autovacuum_vacuum_cost_limit, it pauses for a duration defined by autovacuum_vacuum_cost_delay. If we check the Autovacuum cost limit on most Postgres systems it will show -1, indicating that vacuum_cost_limit provides the actual limit. The default for this parameter is a mere 200.
A Postgres page is 8kb, so if our database recently experienced a high volume of writes, a mere 10 pages (80kb) would be enough to cause Autovacuum to pause for up to 20ms. We say “up to” because the default for autovacuum_vacuum_cost_delay was 20ms until Postgres 12. This was when the community accepted the reality of modern storage system performance and decreased the default delay to 2ms. It’s these older systems that are more likely to exist in the wild and likely need to examine and revise their Autovacuum settings.
In any case, the lower the cost limit, the more frequently Autovacuum will incur the delay penalty. Postgres experts at EDB often recommend increasing autovacuum_vacuum_cost_limit to 2000 and similarly reducing autovacuum_vacuum_cost_delay to the new 2ms default on older systems. This allows Autovacuum to spend more time actually reaping or freezing dead tuples, rather than pausing for user session activity.
The caveat here is to watch storage IO performance. If Autovacuum ends up being too aggressive or is adversely affecting client query performance, it may make sense to reduce the limit slightly. Most modern database hardware has IOPS to spare, and more aggressive settings can help avoid costly outages later.
Expanding the Workforce
Once the throttle is handled, we must consider the fact that our databases likely contain several tables. In some cases this can range into thousands or even tens of thousands. It follows that an Autovacuum worker can only process a single table at a time. If we were to rely on a single worker, it could take hours or even days to vacuum such a database. So Postgres allows us to specify autovacuum_max_workers to increase the amount of parallel workers performing this crucial task.
The default for this value is 3, and it may or may not make sense to increase it. An easy way to tell is to check the pg_stat_activity view and see how many Autovacuum workers are currently processing a table. For example:
SELECT pid, query_start, query
FROM pg_stat_activity
WHERE query LIKE 'autovacuum:%';
pid | query_start | query
-------+-------------------------------+-------------------------------------------
21936 | 2022-03-21 20:44:56.504021+00 | autovacuum: VACUUM public.pgbench_branches
16775 | 2022-03-21 20:44:53.851726+00 | autovacuum: VACUUM public.pgbench_accounts
Lucky users of Postgres 14 can also do this with the new pg_stat_progress_vacuum view without having to wade through user sessions:
SELECT pid, relid::REGCLASS AS table, phase
FROM pg_stat_progress_vacuum;
pid | table | phase
------+------------------+---------------
6589 | pgbench_accounts | scanning heap
6572 | pgbench_branches | scanning heap
In this case, we have a tiny toy database that doesn’t require more workers. But if we have several very large tables that can monopolize a worker for long durations, or thousands of smaller tables that overwhelm available channels through sheer numbers, maintenance will still fall behind. So it’s important to know how many of our available workers are busy at any given time. If there are ever long periods where all workers are constantly occupied, it’s likely time to increase their number.
Keep in mind that all workers evenly split the vacuum cost limit! If our newly adjusted cost limit is 2000, and we have a maximum of five workers, each worker has an effective cost limit of 400. This is another reason to increase the cost limit, as the default settings allocate only 67 (200/3) points per worker before it must pause.
Factoring in Thresholds
The last tweaks we can make to encourage better maintenance is to modify a few parameters dedicated to vacuum frequency. The first of these is autovacuum_vacuum_scale_factor, which determines what fraction of a table must change before launching a vacuum process to reap dead tuples.
The default for autovacuum_vacuum_scale_factor is 0.2, which equates to 20% of the table contents. This may seem excessive at first, but Postgres MVCC keeps old rows inline with new ones. If we allow 20% of a table to remain unallocated, then we can reuse the existing 20% of available tuple space rather than allocate new storage. This allows a table to remain in equilibrium so long as less than 20% of the total contents change in a single transaction.
But there’s a hidden complication here regarding tables which represent vast row counts. If there are one billion rows for example, over 200-million of these must change before Autovacuum starts marking dead tuples. This is amplified the larger a table becomes, and can result in a table remaining stagnant with hundreds of millions of old rows left unidentified between vacuums. Despite this, we do not recommend changing the default in this instance; there’s a better solution.
The related autovacuum_vacuum_threshold parameter specifies the number of rows which must change before Autovacuum will execute and defaults to 50. Interestingly, this parameter is additive with the scale factor. Thus if we have a table with one million rows, Autovacuum will start after DML activity modifies a minimum of 200,050 rows. Some sneaky DBAs thought about this for a moment and then applied the fact that it’s possible to apply certain configuration flags directly to tables.
Going back to our huge pgbench_accounts table, we could adjust Autovacuum to completely ignore the initial table size, and always trigger once 50,000 rows change.
ALTER TABLE pgbench_accounts
SET (autovacuum_vacuum_scale_factor = 0,
autovacuum_vacuum_threshold = 50000);
If we know the row volume of certain batch operations, or simply want more frequent vacuums on a certain table, this is a simple way to exercise direct control. This trick also works with modifying ANALYZE frequency. Thus, if we have a TB-scale table that introduces 200k rows per day, we can ensure fresh daily statistics by changing the defaults to specifically address that scenario:
ALTER TABLE pgbench_accounts
SET (autovacuum_vacuum_scale_factor = 0,
autovacuum_vacuum_threshold = 200000,
autovacuum_analyze_scale_factor = 0,
autovacuum_analyze_threshold = 200000);
The default for autovacuum_analyze_scale_factor is 10%, and while that’s a smaller proportion than required by Autovacuum, such a high proportion can easily prevent timely statistical collection on cumulative data sets. If two years of data is one billion rows, that shouldn’t mean we need about 75-days of modifications before we update our statistics!
Ahead of the Pack
There are two final knobs we’ll be covering in this article which bring us back to the topic of XID wraparound. We’re bringing this up near the end in order to emphasize the abundance of caution necessary when considering these changes.
There are scenarios in extremely high OLTP databases that Autovacuum may be properly configured and well tuned, yet still disruptive to ongoing client activity. This happens when, despite all of our efforts, Autovacuum is simply triggering too often for our transaction volume. This happens when we run afoul of autovacuum_freeze_max_age or autovacuum_multixact_freeze_max_age.
The autovacuum_freeze_max_age parameter determines how old a table must be before a Autovacuum performs a freeze operation to mark applicable rows as permanently visible. The default is a paltry 200-million rows for the sake of aggressively avoiding the 2-billion absolute limit. But what if we have an OLTP system that exceeds 200-million transactions several times per day? I’ve personally managed a database that sustained over 30k TPS, or just under 1-billion transactions during a busy 8-hour work day.
If we have one-thousand tables, and each of them must be frozen five times per day simply to remain under the 200-million transaction horizon, the resulting activity could easily saturate the underlying storage system. In limited circumstances such as this, we actually would recommend increasing autovacuum_freeze_max_age to better model the actual transaction activity on the system. We wouldn’t suggest going over one-billion here simply due to XID wraparound dangers, but even this amount will make a substantial difference.
A higher maximum freeze age also prevents table “bunching”. If transactions are not evenly distributed among our one-thousand tables, some tables will accumulate transaction age faster than others. With a wider distribution, these tables are more likely to trigger a freeze vacuum in isolation of the others. This results in shorter waiting lists for our limited crew of Autovacuum workers.
The autovacuum_multixact_freeze_max_age parameter is similar. Since multiple transactions can lock a single row, Postgres assigns a “multixact ID” to abstract this lock in the row header. While not the global transaction ID, a multixact ID is 32-bits and thus also requires regular maintenance via freezing. The default limit of 400-million is quite a bit higher than the maximum for XIDs, but we still don’t want to be too permissive here. It makes sense to increase this along with autovacuum_freeze_max_age, but try not to exceed 1.2-billion. This corresponds to the maximum suggested freeze age plus an additional 200-million values to account for the default distance between the two parameters.
We didn’t mention this previously, but Postgres retains a visibility map for frozen data pages. In the event a page is already frozen, Postgres excludes it from further processing. This means that more frequent vacuums are often better. Even aggressive vacuums that must visit every non-frozen page do far less work if most pages are already frozen.
Wrapping Up
There’s a lot to consider here; this is hardly a trivial topic. Managing a Postgres production database instance requires close management and monitoring of regular activity including expected maintenance. This is especially true so long as the global transaction ID counter is 32-bits. Yet even without that limitation, the existing Postgres storage engine means curating row turnover through judicious Autovacuum tuning as well.
For those lucky few who can take advantage of BDR or another Multi-Master replication technology, there’s another option: write target toggling. BDR clusters will commonly dictate two nodes per region, with one node acting as a logical alternate write target. Since replication is logical rather than physical, it’s possible to levy much more aggressive maintenance on the “offline” node, up to and including VACUUM FULL. Once the harsher maintenance completes, simply swap the write targets and repeat the process.
High Availability is more than dictating the correct architecture for the environment and implementing an adequate quorum-driven node management layer. Basic care and feeding are ever present, where falling behind may result in an unresponsive or offline node. If I’d retained the default Autovacuum parameters in that 30k TPS system I managed, not even a Multi-Master HA architecture would have prevented Autovacuum from falling hopelessly behind while also encouraging table bloat and terrible table statistics.
The Postgres documentation goes into much greater detail regarding the subtleties of regular vacuum maintenance, but tackling that immediately without a basic understanding can be overwhelming. Hopefully this discussion has provided better insight into how and why Autovacuum works the way it does, and how a few basic tweaks can result in significant improvements and overall behavior from your Postgres cluster.