Session State Failover

October 23, 2017

On the server side, high availability means having the ability to quickly failover to standby hardware, hopefully with no data loss. Failover behavior on the client side is more nuanced. For example, when failover happens, what happens to connected clients? If no connection pooler is being used, clients connected to the failed machine will need to reconnect to the new server to continue their database work. Failover procedures should guarantee that all connections to the failed server are terminated and that no new connections happen. (Reconnecting to the failed server could produce incorrect results and lost changes.) If a client is connected to a standby that is promoted to primary, existing client connections and new connections are read/write.

Clients connect to the new primary via operating-system-specific methods, usually either virtual IP addresses (vip, good blog entry) or dns entries with a short time to live (ttl). This is normally accomplished using dedicated high-availability or clustering software. Postgres 10 will also allow multiple host names to be tried by clients.

For clients using a connection pooler, things are even more complicated. Logically, you would think that, since clients didn't connect directly to the failed server, they should be able to continue their queries in the same session uninterrupted. Generally, this is not the case.

First, the client might have been running a query during the failure — therefore, a query error code needs to be returned to the client so the client can retry the query (assuming query retry logic is built into the application). Second, the session might have been in the middle of a transaction block, meaning it has to resubmit all the queries that were part of the transaction (again, assuming the client has such retry logic). Third, several server configuration parameters might have been modified at the session level. The client might be depending on these settings for proper operation, so they would need to be re-modified in the new session. Fourth, there is additional session state that might need to be recreated. You can get an idea of possible session state modifications by reviewing the discard manual page.

If this sounds like a boat-load of challenges to allow clients using a connection pooler to continue uninterrupted, it is. None of the Postgres pooling tools allows uninterrupted failover. Pgpool automatically disconnects all clients and allows them to reconnect. (In pgpool 3.6+, clients connected to still-running standby servers are not affected.) PgBouncer does not detect failover.

To implement even a limited case where failover appears uninterrupted to clients, connection poolers would have to record session state changes, client queries, and even client results because many queries depend on the results of previous queries, e.g. select ... for update. Even if all this tracking was performed, changed query results would make uninterrupted behavior impossible. Given all this complexity, it isn't surprising that the standard approach to failover is for clients to be programmed to handle such cases by reconnecting, resetting their session state, and re-running any queries that future queries depend on.

Bruce Momjian is a Senior Database Architect at EnterpriseDB. 

This post originally appeared on Bruce's personal blog

Share this