VACUUM and ANALYZE are the two most important PostgreSQL database maintenance operations.
A vacuum is used for recovering space occupied by “dead tuples” in a table. A dead tuple is created when a record is either deleted or updated (a delete followed by an insert). PostgreSQL doesn’t physically remove the old row from the table but puts a “marker” on it so that queries don’t return that row. When a vacuum process runs, the space occupied by these dead tuples is marked reusable by other tuples.
An “analyze” operation does what its name says – it analyzes the contents of a database’s tables and collects statistics about the distribution of values in each column of every table. PostgreSQL query engine uses these statistics to find the best query plan. As rows are inserted, deleted, and updated in a database, the column statistics also change. ANALYZE – either run manually by the DBA or automatically by PostgreSQL after an autovacuum – ensures the statistics are up-to-date.
Although they sound relatively straightforward, behind-the-scenes, vacuuming, and analyzing are two complex processes. Fortunately, DBAs don’t have to worry much about their internals. However, they are often confused about running these processes manually or setting the optimal values for the configuration parameters.
In this article, we will share a few best practices for VACUUM and ANALYZE.
Tip 1: Don’t Run Manual VACUUM or ANALYZE Without Reason
PostgreSQL vacuuming (autovacuum or manual vacuum) minimizes table bloats and prevents transaction ID wraparound. Autovacuum does not recover the disk space taken up by dead tuples. However, running a VACUUM FULL command will do so. VACUUM FULL has its performance implication, though. The target table is exclusively locked during the operation, preventing even reads on the table. The process also makes a full copy of the table, which requires extra disk space when it runs. We recommend not running VACUUM FULL unless there is a very high percentage of bloat, and queries are suffering badly. We also recommend using periods of lowest database activity for it.
It’s also a best practice to not run manual vacuums too often on the entire database; the target database could be already optimally vacuumed by the autovacuum process. As a result, a manual vacuum may not remove any dead tuples but cause unnecessary I/O loads or CPU spikes. If necessary, manual vacuums should be only run on a table-by-table basis when there’s a need for it, like low ratios of live rows to dead rows, or large gaps between autovacuums. Also, manual vacuums should be run when user activity is minimum.
Autovacuum also keeps a table’s data distribution statistics up-to-date (it doesn’t rebuild them). When manually run, the ANALYZE command actually rebuilds these statistics instead of updating them. Again, rebuilding statistics when they’re already optimally updated by a regular autovacuum might cause unnecessary pressure on system resources.
The time when you must run ANALYZE manually is immediately after bulk loading data into the target table. A large number (even a few hundred) of new rows in an existing table will significantly skew its column data distribution. The new rows will cause any existing column statistics to be out-of-date. When the query optimizer uses such statistics, query performance can be really slow. In these cases, running the ANALYZE command immediately after a data load to completely rebuild the statistics is a better option than waiting for the autovacuum to kick in.
Tip 2: Fine-tune Autovacuum Threshold
It’s essential to check or tune the autovacuum and analyze configuration parameters in the postgresql.conf file or in individual table properties to strike a balance between autovacuum and performance gain.
PostgreSQL uses two configuration parameters to decide when to kick off an autovacuum:
- autovacuum_vacuum_threshold: this has a default value of 50
- autovacuum_vacuum_scale_factor: this has a default value of 0.2
Together, these parameters tell PostgreSQL to start an autovacuum when the number of dead rows in a table exceeds the number of rows in that table multiplied by the scale factor, plus the vacuum threshold. In other words, PostgreSQL will start autovacuum on a table when:
pg_stat_user_tables.n_dead_tup > (pg_class.reltuples x autovacuum_vacuum_scale_factor) + autovacuum_vacuum_threshold
For small to medium-sized tables, this may be sufficient. For example, a table with 10,000 rows, the number of dead rows has to be over 2,050 ((10,000 x 0.2) + 50) before an autovacuum kicks off.
Not every table in a database experiences the same rate of data modification. Usually, a few large tables will experience frequent data modifications, and as a result, will have a higher number of dead rows. The default values may not work for such tables. For example, with the default values, a table with 1 million rows will need to have more than 200,050 dead rows before an autovacuum starts ((1000,000 x 0.2) + 50). This can mean longer gaps between autovacuums, increasingly long autovacuum times, and worse, autovacuum not running at all if active transactions on the table are locking it.
Therefore, the goal should be to set these thresholds to optimal values so autovacuum can happen at regular intervals and don’t take a long time (and affect user sessions) while keeping the number of dead rows relatively low.
One approach is to use one or the other parameter. So, if we set autovacuum_vacuum_scale_factor to 0 and instead set autovacuum_vacuum_threshold to, say, 5,000, a table will be autovacuumed when its number of dead rows is more than 5,000.
Tip 3: Fine-tune Autoanalyze Threshold
Similar to autovacuum, autoanalyze also uses two parameters that decide when autovacuum will also trigger an autoanalyze:
- autovacuum_analyze_threshold: this has a default value of 50
- autovacuum_analyze_scale_factor: this has a default value of 0.1
Like autovacuum, the autovacuum_analyze_threshold parameter can be set to a value that dictates the number of inserted, deleted, or updated tuples in a table before an autoanalyze starts. We recommend setting this parameter separately on large and high-transaction tables. The table configuration will override the postgresql.conf values.
The code snippet below shows the SQL syntax for modifying the autovacuum_analyze_threshold setting for a table.
ALTER TABLE <table_name>
SET (autovacuum_analyze_threshold = <threshold rows>)
Tip 4: Fine-tune Autovacuum Workers
Another parameter often overlooked by DBAs is autovacuum_max_workers, which has a default value of 3. Autovacuum is not a single process, but a number of individual vacuum threads running in parallel. The reason for specifying multiple workers is to ensure that vacuuming large tables isn’t holding up vacuuming smaller tables and user sessions. The autovacuum_max_workers parameter tells PostgreSQL to spin up the number of autovacuum worker threads to do the cleanup.
A common practice by PostgreSQL DBAs is to increase the number of maximum worker threads in the hope that it will speed up autovacuum. This doesn’t work as all the threads share the same autovacuum_vacuum_cost_limit, which has a default value of 200. Each autovacuum thread is assigned a cost limit using this formula shown below:
individual thread’s cost_limit = autovacuum_vacuum_cost_limit / autovacuum_max_workers
The cost of work done by an autovacuum thread is calculated using three parameters:
- vacuum_cost_page_hit: this has a default value of 1
- vacuum_cost_page_miss: this has a default value of 10
- vacuum_cost_page_dirty: this has a default value of 20
What these parameters mean is this:
- When a vacuum thread finds the data page that it’s supposed to clean in the shared buffer, the cost is 1.
- If the data page is not in the shared buffer, but the OS cache, the cost will be 10.
- If the page has to be marked dirty because the vacuum thread had to delete dead rows, the cost will be 20.
An increased number of worker threads will lower the cost limit for each thread. As each thread is assigned a lower cost limit, it will go to sleep more often as the cost threshold is easily reached, ultimately causing the whole vacuum process to run slow. We recommend increasing the autovacuum_vacuum_cost_limit to a higher value, like 2000, and then adjusting the maximum number of worker threads.
A better way is to tune these parameters for individual tables only when necessary. For example, if the autovacuum of a large transactional table is taking too long, the table may be temporarily configured to use its own vacuum cost limit and cost delays. The cost limit and delay will override the system-wide values set in postgresql.conf.
The code snippet below shows how to configure individual tables.
ALTER TABLE <table_name> SET (autovacuum_vacuum_cost_limit = <large_value>)
ALTER TABLE <table_name> SET (autovacuum_vacuum_cost_delay = <lower_cost_delay>)
Using the first parameter will ensure the autovacuum thread assigned to the table will perform more work before going to sleep. Lowering the autovacuum_vacuum_cost_delay will also mean the thread is sleeping less amount of time.
Final Thoughts
As you can see, changing configuration parameters for vacuum and analysis is straightforward, but it needs careful observation first. Every database is different in terms of its size, traffic pattern, and rate of transactions. We recommend DBAs start by gathering enough information about their database before changing the parameters or rolling out a manual vacuum/analyze regime. Such information could be:
- Number of rows in each table
- Number of dead tuples in each table
- The time of the last vacuum for each table
- The time of last analyze for each table
- The rate of data insert/update/delete in each table
- The time taken by autovacuum for each table
- Warnings about tables not being vacuumed
- Current performance of most critical queries and the tables they access
- Performance of the same queries after a manual vacuum/analyze
From here, DBAs can select a few “pilot” tables to start optimizing. They can start changing the vacuum/analyze properties for the tables and check the performance. PostgreSQL is a smart database engine – DBAs will often find it’s probably best to let PostgreSQL do the vacuuming and analyzing rather than doing those manually.