Synchronous replication in Postgres is expensive. Very expensive. For any transaction to complete in a synchronous context, it must be acknowledged by at least one allocated replica system. Depending on the latency between these systems, the associated overhead costs are almost tangible. And each subsequent required response only increases the burden.
For a company with an extremely low Recovery Point Objective—the point at which data must not be lost—this introduces a somewhat perplexing conundrum. Do we incur the merciless wrath of physics, or widen our window of acceptable data loss restrictions? Depending on the industry and various regulatory requirements, the latter may not even be legal, so what to do?
What if we’ve been framing this entire argument all wrong all these years?
CPU workload quirks
Usually one of the first complaints levied against synchronous replication is due to the overhead associated with remote transaction acknowledgment. Benchmarks will indeed bear this out, with a lower write transaction throughput from the same amount of clients. Discussion over, right?
What are our server CPUs doing while waiting for some other semi-related task to complete? In a perfect world, they’ve been task-switched to another equally important inquiry, and will resume when the blocking system call has completed. CPUs on servers dedicated to Postgres are likely to handle another Postgres session in this scenario, which can be either good or bad depending on the nature of the blocking call.
Imagine ten sessions each simultaneously invoke an extremely storage IO intensive query. If we have ten CPUs, then we’ll end up with ten idle CPUs waiting for the storage subsystem to return with the requested data. What if we only had four CPUs instead? Aside from some relatively inexpensive context switching, those four CPUs are still waiting on the same IO calls to return. Depending on the size of the resulting data sets, four CPUs may be entirely sufficient for that workload.
Consider that this also applies to synchronous replication, as it is just another external blocking call. Postgres commits a write to the WAL, that change enters the replication stream, and Postgres waits for some kind of acknowledgment from one or more replicas before returning control to the session. For the entire duration of that exchange, the CPU could be doing something else.
So why not let it?
Preparing the environment
We here at EDB openly wondered about this; unlike storage IO waits, transaction acks are non-destructive. After all, an overloaded storage subsystem only tends to become slower as the amount of requests increases. Provided there isn’t much replication lag, transaction acknowledgements impose far lower overhead on a Postgres replica. But we wanted to test that to make sure, because if we were right, it could change everything. Or at least the perception of synchronous replication in Postgres.
To that end, we spun up a Postgres cluster in Amazon EC2 within a single availability zone. Each server consisted of an r5.2xlarge, which is equipped with 8 CPUs and 64GB of RAM. We assigned two 10,000 IOPS io2 EBS devices to each of these—one for data and one for WAL. Then we bootstrapped the cluster with pgbench at a scale of 10,000 for a total database size of about 150GB.
Regarding the synchronous configuration itself, we set synchronous_standby_names to the names of both standby systems, like this:
synchronous_standby_names = '2 ("pg-node-2", "pg-node-3")'
This provided a kind of “worst case” scenario where a widely distributed cluster required full synchronous commit for maximum RPO guarantees. This is a situation where a company would rather stop handling transactions entirely than lose a single one.
When everything was said and done, the cluster looked like this:
Wait, didn’t we just say it was a single availability zone? Latency between Amazon availability zones and regions isn’t exactly reliable, and we wanted consistent metrics. So we took advantage of a Linux utility called tc, a command-line tool that lets us manipulate the network traffic control system within the kernel.
Some Linux command incantations are more esoteric than others, and tc is particularly unintuitive, but we eventually applied this logic to each server in the cluster, using the first server as an example:
tc qdisc del dev ens5 root tc qdisc add dev ens5 root handle 1: prio tc qdisc add dev ens5 parent 1:3 handle 30: netem latency 5ms tc filter add dev ens5 protocol ip parent 1:0 prio 3 u32 \ match ip dst ip.of.server.2 flowid 1:3 tc filter add dev ens5 protocol ip parent 1:0 prio 3 u32 \ match ip dst ip.of.server.3 flowid 1:3
That’s a lot of work to say “If you’re talking to another Postgres server, add 5ms of latency.” The rule is also reciprocal, so the total latency is 10ms between any two Postgres servers, but the pgbench system itself is unaffected.
Armed with 10ms of latency between each of our synchronous Postgres servers, it was time to move on to the benchmarks.
Synchronous write scaling
Postgres has five different possible settings for the synchronous_commit configuration parameter. We’re going to ignore ‘off’, because it should only be used in extremely limited circumstances. The remaining possibilities include:
- local - no remote guarantee at all, just local write and flush
- remote_write - transaction has been written to the remote node’s disk
- on - transaction has been written and flushed to the remote node’s disk
- remote_apply - transaction has been written, flushed, and applied on the remote node
We decided to test each of these to see how much overhead write guarantees impose as they become more strict. The only other variable we changed for each benchmark was the amount of clients.
Let’s take a look at how everything turned out:
It’s no surprise that local commits were the fastest by quite a wide margin. However, consider what happens as the client count increases. At 40 clients, “remote_write” is only 60% as fast as “local”, but at 80 clients it’s 84%, and the gap only closes as client count increases. Additionally, note that the “remote_write” and “on” synchronous modes eclipse the 40-client “local” commits starting at 120 clients. Query latency follows a very similar trajectory.
At 40 clients, “remote_write” adds 67% more write latency over “local” transactions (6.5ms), but at 80 clients additional latency is only 19% at around 3.3ms. Remember that our chosen hardware is only equipped with 8 total CPU threads, so the accumulated latency is already steadily increasing due to context switching. Another 5ms of additional latency added to 33ms is proportionally less overhead than adding 6.5 to 9.5.
The story actually gets a bit more interesting when we reduce the network latency to 3ms. We selected 10ms to represent a relatively large geographical distribution, such as three data centers separated by several hundred miles. Proximity is one thing that can vary immensely, and availability zones within most cloud services are probably much closer than 10ms. There’s also the lingering question of round-trip-times causing effectively exponential latency amplification.
The resulting benchmark results appear to substantiate these speculations:
As expected, synchronous transactions are definitely slower than asynchronous, but the gap has closed considerably. Not only is “remote_write” operating at 92% of “local” at 40 clients, but it reaches parity at 80 clients rather than 120. Even “on” and “remote_apply” are within 7% of “local” performance at 80 clients, and reach near parity within 1% at 120 and even measure slightly higher at 160.
Given how much pgbench results can vary between runs, we attribute the faster performance of synchronous replication at 160 clients to statistical noise. It’s pretty interesting to see that 3ms of network latency is essentially reduced to statistical noise on a generally overloaded server, but 10ms maintains a steady gap after a certain point.
The query latency graph is equally revealing:
The results are very tightly grouped throughout all tests, where all results are within 1ms of each other. It would appear that the server is so busy handling client requests that any latency introduced by synchronous replication has become background noise. If this is true of 3ms network latency, lower amounts would be equally difficult to distinguish from asynchronous performance.
Consider that we constructed a truly awful proof-of-concept here. We constrained ourselves to enforce fully synchronous writes across three servers with 10ms of latency between all network hops. Usually such tight restrictions wouldn’t be imposed over such a wide geographical area, but as we’ve seen here, it could be.
Even if geographical distribution isn’t a concern, it’s very common to recommend synchronous replicas remain as close as physically possible to the Primary node to avoid incurring excessive commit latency. This can mean within the same rack, the same virtual host, the same availability zone, and so on. Perhaps those restrictions may be loosened somewhat given our findings here.
This does all come with a certain caveat, however. Low client counts that don’t overwhelm the hardware will still perform optimally under asynchronous operation. This does imply it’s possible to under-provision hardware that’s meant to be used in a synchronous Postgres cluster and still achieve high transaction throughput.
There’s also a break-even point where saturating the hardware essentially obfuscates performance behavior. Consider our 10ms latency test at 80 clients where we reached peak transaction throughput on this hardware. We observed 17ms of latency for local commits at that client count, and 20ms for remote writes. How significant is this 3ms?
There’s still quite a bit of discussion to have if we also consider the “remote_apply” and “on” synchronous settings. They impose additional latency, and some may argue that “remote_write” is insufficient because it does not require a disk flush on the replicas. Higher connection counts do narrow this gap at high network latencies, but do not erase it.
Additionally, we’re very clearly overloading these servers. Higher transaction latencies do indeed help absorb network latency effects, but only because all queries operate with more latency in general. At 40 clients at 10ms network latency, a local commit happens at 9.5ms, and a remote one at 16ms. Yes this gap shrinks at 80 clients, but at this point the local server is so backlogged, all transactions incur a 17ms penalty. We can only do this kind of client overloading if the application in question is not particularly latency sensitive.
In the end, perhaps the question becomes: what’s 5-10ms between friends? An idle CPU is a wasted CPU, and technologies like client database pools make it possible to multiplex client connections to specific counts at the database level. If tests show a particular piece of hardware operates best at 80 client connections, keeping CPUs saturated but not so overwhelmed that latency rises to compensate, we can make that happen.
As we’ve seen here, it’s possible to factor in network latency as part of synchronous replication as well, and if we do it right, nobody will ever know.