MVCC and VACUUM

Robert M. Haas

December 19, 2017

Experienced PostgreSQL users and developers rattle off the terms “MVCC” and “VACUUM” as if everyone should know what they are and how they work, but in fact many people don’t. This blog post is my attempt to explain what MVCC is and why PostgreSQL uses it, what VACUUM is and how it works, and why we need VACUUM to implement MVCC. In addition, I’ll include a few useful links for further reading for those who may want to know more.

PostgreSQL is a relational database and therefore attempts to provide transactions which have the “ACID” properties: atomicity, consistency, isolation, and durability. Let’s ignore consistency and durability for the moment. Atomicity means that transactions are “all or nothing”; either the transaction commits and all of its changes take effect, or it aborts and none of the changes take effect. Isolation means that each transaction should be unaware that there are other transactions executing concurrently. For example, if transaction A updates two rows with a = 1 to have a = 2 and concurrent transaction B deletes all rows with a = 2, it should never happen that one of the two rows updated by A gets deleted and the other does not. Either A “happens first” and thus both rows are deleted, or B “happens first” and thus neither row is deleted. Taken together, this means that the database would like to give the user the impression that no concurrent changes are taking place while their transaction is in progress, and that when their transaction commits, all of the changes it made become instantaneously and simultaneously visible to everyone.

Relational databases in general - not PostgreSQL specifically - have settled on two ways of doing this. The first is two-phase locking: a transaction takes a shared lock on every bit of data it reads and an exclusive lock on every bit of data that it writes. When it commits, it releases those locks; if it aborts, it reverses the changes that it made and then releases locks. Because any modified data is protected by an exclusive lock, no concurrent transaction will see that change until it is committed. Meanwhile, readers can coexist, since their locks are all shared. Unfortunately, when data is both read and written, this approach has poor concurrency: readers and writers block each other, because the shared locks taken by readers conflict with the exclusive locks taken by writers. Locks are bad for concurrency, and two-phase locking requires exclusive locks (and shared locks, for full correctness) to be held for the entire duration of the transaction.

The second approach to providing transactions with atomicity and isolation is multi-version concurrency control (MVCC). The basic idea is simple: instead of locking a row that we want to update, let’s just create a new version of it which, initially, is visible only to the transaction which created it. Once the updating transaction commits, we’ll make the new row visible to all new transactions that start after that point, while existing transactions continue to see the old row. That way, every transaction sees what is in essence a consistent snapshot of the database. Finally, when there are no transactions remaining that can see the old version of the row, we can discard it. This approach has much better concurrency than two-phase locking, which is why it is used by many modern database systems. However, we’ve exchanged our concurrency problem for a space management problem. With two-phase locking, a row that gets updated can be updated in place; because the updating transaction holds an exclusive lock, nobody else can see the modifications until the lock is released. With MVCC, this won’t work, since we need to keep both versions at least for a short while (or perhaps for a long while, if there are long-running concurrent transactions). Either the old row version must be saved elsewhere, and the new version put in its place, or the old row version must be left untouched, and the new version stored elsewhere. This is an area where different database systems have adopted different approaches.

PostgreSQL’s approach to this problem is to store new row versions created by UPDATE in basically the same way that it would store a completely new row generated by an INSERT. Once an UPDATE commits, future SQL statements (or future transactions, depending on your choice of isolation level) will see the new row versions, and the old ones will be visible only to statements (or transactions) which are already running. Eventually, there will no longer be anything running which can see the old row versions, at which point they are “dead”. Similarly, when a DELETE commits, there will eventually be nothing running which can see the old row versions, at which point they are likewise “dead”. Similarly, if an INSERT aborts, the new row versions which it inserted go from being visible only to the inserting transaction to being visible to nobody at all, and are thus “dead”, and likewise when an UPDATE aborts, any new row versions which it created are “dead”. The Heroku Dev Center has a well-written article explaining some details of how this works under the hood.

And that brings us to the need for VACUUM. Any system which uses MVCC to implement transaction isolation must have some way of getting rid of dead row versions. If it did not, the number of dead row versions would grow without bound, and therefore the database size would grow without bound, which would be bad. At some point, we must get rid of the dead row versions so that the space can be reused. In PostgreSQL, this is VACUUM’s job. Some other systems use different cleanup mechanisms for different kinds of dead row versions; for example, dead row versions created by aborted transactions may be handled differently from dead row versions created by committed transactions. In PostgreSQL, we handle of these in the same way. It bears mentioning that in certain cases, dead row versions can be cleaned up using a page-at-a-time mechanism called HOT pruning (for technical details, see README.HOT). These cases are actually quite common; however, HOT pruning is a “best effort” strategy, and VACUUM is the backstop that makes sure the work gets done and completes whatever cleanup can’t be done by HOT pruning. (This brief explanation omits many interesting details, but this post is already getting quite long.)

So, here’s how VACUUM works. First, it scans every page in the table (also known as the “heap”) that might possibly contain dead row versions. A data structure called the visibility map keeps track of which pages have been modified since the last VACUUM; those that have not been may be skipped. As it does, it removes dead row versions from those pages and makes the space occupied by those tuples available for reuse. During this process, dead row versions are replaced with a sort of tombstone - in technical terms, the line pointer that pointed to the row version which was removed is marked LP_DEAD. If no dead tuples are found during this stage, VACUUM stops. Otherwise, it scans every index on the table and removes all index entries which point to the dead line pointers identified during the first phase. Once this phase is complete, it goes back to the table and removes the tombstones - in technical terms, the LP_DEAD line pointers are marked LP_UNUSED. Once this is done, those line pointers can be reused for new tuples. If those line pointers were reused before the index entries were removed, we would potentially end up with index entries that don’t match the row versions to which they point, which could cause queries to return wrong answers.

The previous paragraph used the term “line pointer”, but did not define it. Each page begins with an array of line pointers which identify the starting position of each tuple on the page, and the length of that tuple. There are a few special kinds of line pointers, including LP_UNUSED (which means that the array slot is available to be used for a new tuple), LP_DEAD (which, as noted above, means that the tuple has been removed from the page but the line pointer slot isn’t available to be reused yet), and LP_REDIRECT (which is related to HOT, but beyond the scope of this blog post).

In short, VACUUM scans the parts of the table that have changed since the last VACUUM, then scans the indexes in their entirety, then scans the changed parts of the table again. By doing so, it ensures that dead row versions and the index entries which referenced them are removed, and that the line pointers which referenced those dead row versions are made available for reuse.

Some final notes:

1. An ordinary VACUUM is very different from VACUUM FULL, which actually rewrites the whole table.

2. In addition to its function in removing dead tuples, VACUUM also prevents transaction ID wraparound, updates statistics used by the query planner, and updates the visibility map. Updating the visibility map is important not only to make future VACUUM operations more efficient, but also to make index-only scans work better. This is all covered in the documentation.

3. Generally, you don’t need to run VACUUM by hand, because the autovacuum daemon will arrange to run it automatically. However, there are definitely times when it’s useful to run VACUUM by hand. This can be a good idea, for example, after a bulk load, or during a low-usage time of day, to reduce the need for later vacuuming during high-usage periods.

Robert Haas is Vice President, Chief Database Architect, at EnterpriseDB.

This post originally appeared on Robert's personal blog.

Resource Feature Callout 1

MVCC and VACUUM

Robert M. Haas