The new Hot Standby feature in the upcoming PostgreSQL 9.0 allows running queries against standby nodes that previously did nothing but execute a recovery process. Two common expectations I’ve heard from users anticipating this feature is that it will allow either distributing short queries across both nodes, or allow running long reports against the standby without using resources on the master. These are both possible to do right now, but unless you understand the trade-offs involved in how Hot Standby works there can be some unanticipated behavior here.
Standard Long-running Queries
One of the traditional problems in a database using MVCC, like PostgreSQL, is that a long-running query has to keep open a resource–referred to as a snapshot in the current Postgres implementation–to prevent the database from removing data the query needs to operate. For example, just because another client has deleted a row and committed, if an already running query needs that row to complete you can’t actually wipe the physical disk blocks related to that row out just yet. You have to wait until no open queries that expect that row to be visible are still around.
Hot Standby Limitations
If you have a long-running query you want Hot Standby to execute, there are a couple of types of bad things that can happen when the recovery process is applying updates. These are described in detail in the Hot Standby Documentation. Some of these bad things will cause queries running on the standby to be canceled for reasons that might not be intuitively obvious:
- A HOT update or VACUUM related update arrives to delete something that query expects to be visible
- A B-tree deletion appears
- There is a locking issue between the query you’re running and what locks are required for the update to be processed.
The lock situation is difficult to deal with, but not very likely to happen in practice for all that long if you’re just running read-only queries on the standby, because those will be isolated via MVCC. The other two are not hard to run into. The basic thing to understand is that any UPDATE or DELETE on the master can lead to interrupting any query on the standby; doesn’t matter if the changes even relate to what the query is doing.
Good, fast, cheap: pick two
Essentially, there are three things people might want to prioritize:
- Avoid master limiting: Allow xids and associated snapshots to advance unbounded on the master, so that VACUUM and similar cleanup isn’t held back by what the standby is doing
- Unlimited queries: Run queries on the standby for any arbitrary period of time
- Current recovery: Keep the recovery process on the standby up to date with what’s happening on the master, allowing fast fail-over for HA
In any situation with Hot Standby, it’s literally impossible to have all three at once. You can only pick your trade-off. The tunable parameters available already let you optimize a couple of ways:
- Disabling all these delay/defer settings optimizes for always current recovery, but then you’ll discover queries are more likely to be canceled than you might expect.
- max_standby_delay optimizes for longer queries, at the expense of keeping recovery current. This delays applying updates to the standby once one that will cause a problem (HOT, VACUUM, B-tree delete, etc.) appears.
- vacuum_defer_cleanup_age and some snapshot hacks can introduce some master limiting to improve on the other two issues, but with a weak UI to do that. vacuum_defer_cleanup_age is in units of transaction IDs. You need to have some idea the average amount of xid churn on your system per unit of time to turn the way people think about this problem (“defer by at least 1 hour so my reports will run”) into a setting for this value. xid consumption rate just isn’t a common or even reasonable thing to measure/predict. Alternately, you can open a snapshot on the primary before starting a long-running query on the standby. dblink is suggested in the Hot Standby documentation as a way to accomplish that. Theoretically a daemon on the standby could be written in user-land, living on the primary, to work around this problem too (Simon has a basic design for one). Basically, you start a series of processes that each acquire a snapshot and then sleep for a period before releasing it. By spacing out how long they each slept for you could ensure xid snapshots never advanced forward too quickly on the master. It should already sound obvious how much of a terrible hack this would be.
Potential Improvements
The only one of these you can really do something about cleanly is tightening up and improving the UI for the master limiting. That turns this into the traditional problem already present in the database: a long-running query holds open a snapshot (or at least limits the advance of visibility related transaction IDs) on the master, preventing the master from removing things needed for that query to complete. You might alternately think of this as an auto-tuning vacuum_defer_cleanup_age.
The question is how to make the primary respect the needs of long running queries on the standby. This might be possible if more information about the transaction visibility requirements of the standby were shared with the master. Doing that sort of exchange would really be something more appropriate for the new Streaming Replication implementation to share. The way a simple Hot Standby server is provisioned does not provide any feedback toward the master suitable for this data to be exchanged, besides approaches like the already mentioned dblink hack.
With PostgreSQL 9.0 just reaching a fourth alpha release, there may still be time to see some improvements in this area yet before the 9.0 release. It would be nice to see Hot Standby and Streaming Replication really integrated together in a way that accomplishes things that neither is fully capable of doing on their own before coding on this release completely freezes.