One of the more prominent limitations of the Postgres database storage engine is the fact that it utilizes a 32-bit integer to track the current transaction identifier (XID). For the more uninformed Postgres users, this means there’s a hard limit of roughly 2-billion transactions hard-coded into the system before maintenance is necessary. Failure to perform this maintenance means that the global transaction ID wraps around back to zero, essentially invalidating any past transactions as the counter continues to advance.
When a Postgres cluster outage hits the news, or a critic outlines their dislikes regarding the database engine, there’s a good chance Transaction ID wraparound will make a prominent appearance. This isn’t because wraparounds are a common occurrence, but because of the disproportionate amount of problems that result when encountered.
Let’s talk about how wraparound happens, and how we can avoid it using proactive Postgres configuration settings and appropriate monitoring.
You look like you're havin' fun
The opening paragraph to this article is something of a gross oversimplification at how the global transaction ID counter works. Postgres uses Multi-Version Concurrency Control (MVCC) to establish several guarantees. Postgres marks every row in a table with some metadata including the ID of the committing transaction. Updates don’t actually change row contents at all, but create a new row so old row versions are still available to ongoing transactions. The only modifications are made to row metadata when applicable.
As an example, consider this test table and contents:
CREATE TABLE xid_test ( test_id SERIAL PRIMARY KEY, content TEXT NOT NULL ); INSERT INTO xid_test (content) VALUES ('Hello world!'); SELECT xmin, xmax, * FROM xid_test; xmin | xmax | test_id | content ---------+------+---------+-------------- 2996470 | 0 | 1 | Hello world!
In this particular case, transaction 2996470 created the row we see right now. What happens if we execute an update in another session but refrain from committing our transaction? Let’s see:
-- Session 1 BEGIN; -- Session 2 BEGIN; UPDATE xid_test SET content = 'Goodbye world!' WHERE test_id = 1; -- Session 1 SELECT xmin, xmax, * FROM xid_test; xmin | xmax | test_id | content ---------+---------+---------+-------------- 2996470 | 3502545 | 1 | Hello world!
Since we started a transaction before the update statement, session 1 should not see data that exists in the future. The xmax column marks the transaction that deleted this row version. Once session 2 commits, our content becomes “Goodbye world!” and that new row version becomes the only visible copy.
Now imagine everything is fine and dandy, and we churn through a couple billion transactions since the last update above. Now the current XID is 2996470 again, and we have a bit of a problem. Transaction 3502545 hasn’t happened yet, so now we can see an old row version. This is the core of transaction ID wraparound, and why Postgres goes through great pains—including a forced database shutdown in extreme cases—to prevent it.
Now I... I got my sights on you
Transaction ID wraparound and MVCC is the reason Postgres needs regular VACUUM maintenance. When invoked, Postgres assigns VACUUM a transaction ID and it operates like any other transaction. Any rows with an xmax older than the XID of the vacuum are marked as reusable, preventing future XID conflicts those rows may have caused. But this doesn’t truly solve the problem.
What about current rows that were not deleted by a subsequent write operation and have an xmax of zero? If our transaction ID is lower than the current xmin as in a wraparound, those rows become invisible because they haven’t happened yet! There’s an operation called FREEZE which takes care of this too. If we examine the page header for the single row and subsequent update, we’ll see this:
SELECT lp, t_ctid, t_xmin, t_xmax, t_infomask & (256 | 512) = (256 | 512) AS frozen FROM heap_page_item_attrs(get_raw_page('xid_test', 0), 'xid_test'); lp | t_ctid | t_xmin | t_xmax | frozen ----+--------+---------+---------+-------- 1 | (0,2) | 2996470 | 3502545 | f 2 | (0,2) | 3502545 | 0 | f
The first value corresponds to our original row, and the second is our updated copy. The frozen bit mask is actually equal to the condition where a transaction is both committed (256) and invalid (512), which is impossible. This tells Postgres to ignore the xmin reading, essentially meaning this row will transcend the XID horizon no matter what its current value may be. Effectively, until another update modifies this row, it will persist forever and be totally immune to the effects of wraparound.
So let’s freeze our new row:
SET vacuum_freeze_min_age = 1000000; VACUUM FREEZE xid_test; SELECT lp, t_ctid, t_xmin, t_xmax, t_infomask & (256 | 512) = (256 | 512) AS frozen FROM heap_page_item_attrs(get_raw_page('xid_test', 0), 'xid_test'); lp | t_ctid | t_xmin | t_xmax | frozen ----+--------+---------+--------+-------- 1 | | | | 2 | (0,2) | 3502545 | 0 | t
As expected, the VACUUM made the original tuple reusable by new rows, but it also made the necessary header changes to mark the current row as frozen. Now this row is visible to all newly established transactions, no matter where the current XID may be.
There’s one important caveat to this statement: Postgres can’t vacuum or freeze rows that older transactions can still see! This is usually the underlying cause of XID wraparound. It’s a subtle failure of the maintenance mechanisms because they were prevented from performing this crucial task.
Watch out, here I come
So how do we find activity that can prevent VACUUM from doing its job? The fastest and easiest is to check at the database level.
SELECT datname AS database_name, age(datfrozenxid) AS age FROM pg_database; database_name | age ---------------+--------- postgres | 3501820 template1 | 3501820 template0 | 3501820
The age() function tells us the “true” age of a tuple as viewed by its distance from the XID horizon. Since there’s been no wraparound in this test instance, the age is basically just the current XID value of 3.5M. We can also see that the XID value is global across all databases in the cluster. This means each database can have a different age depending on when the tables inside it were last frozen.
It’s a good idea to track this age for each database in a monitoring dashboard, barring the template databases. If we set alerts or observe that the age has exceeded 1-billion, it’s time to start investigating further. There are basically three major elements that can prevent VACUUM maintenance on tables.
Prepared transactions are a rarely utilized functionality that is disabled by default, and are intended for advanced transaction managers to provide two-phase commit. We enabled them by setting max_prepared_transactions to a non-zero value in our test system for demonstration purposes:
SHOW max_prepared_transactions; -- This will usually be zero max_prepared_transactions --------------------------- 5
In the rare event a cluster does have prepared transactions enabled, it’s important to understand that these will persist through database restarts. Thus a database crash or even a full system reboot will not clear prepared transactions. It’s not uncommon for Autovacuum to fall weeks or even months behind, since it can’t clean up any rows newer than the oldest visible XID. In this case, the oldest prepared transaction ID is preventing system maintenance.
Luckily it’s easy to find these, as Postgres has a view which specifically tracks them:
SELECT * FROM pg_prepared_xacts; transaction | gid | prepared | owner | database -------------+------+-------------------------------+-------+---------- 3502548 | test | 2022-03-10 09:12:09.378407-06 | bones | postgres
What would have happened if we had a very active database and one billion transactions elapsed since we created and then abandoned this transaction? The only way to address these is to commit them, or roll them back individually:
-- Commit the prepared transaction COMMIT PREPARED 'test'; -- Roll back the prepared transaction ROLLBACK PREPARED 'test';
Since we can see the creation date of the transaction, this may influence whether we decide to commit or roll back. Once all prepared transactions are accounted for, there are a few other things we should check.
Long running transactions
Databases are heterogeneous resources, so it’s common for a mix of long and short transactions and batches to regularly intermingle. Sometimes an application is miscoded or hangs in the middle of batch processing. Such processes can no longer issue commands to the Postgres session, and if they had opened a transaction before encountering a problem, the transaction will also remain active. This can also happen for badly written queries that unexpectedly execute for several hours.
No matter what causes a transaction to last for several hours (or even days), the result is that VACUUM can’t maintain data beyond that transaction ID. Once again, the system catalog is our benefactor:
SELECT datname, pid, client_addr, now() - xact_start AS duration, now() - state_change AS time_idle FROM pg_stat_activity WHERE now() - state_change > INTERVAL '10 minutes' AND now() - xact_start > INTERVAL '10 minutes'; datname | pid | client_addr | duration | time_idle ----------+---------+-------------+-----------------+----------------- postgres | 3067701 | 10.4.4.17 | 01:29:32.727505 | 01:29:32.727469
This is a very simple query that identifies all transactions which have been running for at least 10 minutes without client feedback. In this case, we can see that a transaction has been running for over an hour, and hasn’t received any commands for that entire duration.
Of course, an ounce of prevention is worth a pound of cure. As of Postgres 9.6, it’s possible to prevent such idle (and presumably abandoned) transactions by telling Postgres to handle them automatically. For example, this would terminate any transactions greater than an hour:
idle_in_transaction_session_timeout = 3600
It’s probably a good idea to set this to a reasonable value in especially sensitive production systems. But keep long-running batches or late-night processing in mind! Audit these to determine the maximum expected run duration, and then add a comfortable padding to prevent terminating expected processes.
Hot standby feedback
A well designed Postgres database cluster contains at least one other replica node. Rather than let such servers languish, it’s extremely common to use them for read-only query activity. Physical Postgres replicas are effectively exact binary copies, meaning that any pages VACUUMed on the Primary also require modification on the replica.
The Postgres WAL Apply process on the replica accomplishes this by making changes that appear in the WAL stream. Now imagine a query is running on a replica, and sees a row that the WAL Apply process needs to remove. Postgres can either cancel the query, or wait for it to finish. Postgres can’t wait forever, because while it’s waiting, we’re accumulating replication lag. So understandably, Postgres will terminate the query. There are configuration parameters to control how long Postgres will wait before doing this, but are largely unused.
Why? The hot_standby_feedback configuration parameter takes a different approach, and is far more commonly deployed. Rather than terminating replica queries, it actually forwards row usage to the Primary. The Primary node will then refrain from VACUUMing rows that are being used by a replica. This essentially transforms long transactions on a replica into long transactions on the Primary as well.
This means we should closely monitor all replicas with hot standby feedback enabled for long-running transactions. It’s also a good idea to use the same idle transaction session timeout on all nodes. Since such transactions are effectively propagated to the Primary node, they can adversely affect the XID horizon across the whole cluster.
Just a little bit closer
Transaction ID wraparound sounds terrifying, but we have the tools at our disposal to ensure we’re immune. Regular maintenance is the key, and so long as the maintenance isn’t prevented by hidden pitfalls, there’s actually very little risk.
All of this assumes, of course, that Autovacuum is properly configured for the observed workload on the Primary system. The default settings are relatively conservative, so it’s possible that critical FREEZE maintenance slowly falls behind. In our next article, we’ll explore properly tuning all of the various Autovacuum knobs to ensure that doesn’t happen.
Until then, keep on safely spinning that XID ‘round.