The February 2023 release of EDB Postgres Distributed 5.0 ushered in a plethora of enhancements to EDB’s premier write-anywhere Postgres platform. One of these included a colossal overhaul of how PGD handles client connection routing. It’s a significant departure from the past design, so we wanted to discuss a bit about why we made these changes, and the improvements they bring.
Let’s get started!
High Availability Routing for Postgres
When we first introduced EDB Postgres Distributed, it went by the name of BDR, short for Bi-Directional Replication. Being that writes can occur on any node, there’s an inherent risk of write conflicts if two nodes simultaneously modify the same row in some way. PGD provides several mechanisms to address this, such as custom conflict handling, column-level conflict resolution, Conflict-free Replicated Data Types (CRDT), and Eager resolution, among others. Yet the best way to prevent conflicts is to avoid them.
The easiest way to do this is simply to prevent writes on multiple PGD nodes. To accomplish this, we introduced the HARP, a Patroni-inspired design based on a Distributed Configuration Store (DCS). The idea was simple: each PGD node has a HARP Manager service that conveys node status to the DCS, and any HARP Router contacts the DCS to obtain the canonical Lead Master of the cluster. Clients connect to HARP Router, and all is right with the world.
The architecture of this was a bit more complicated:
This is actually a simplified view of how all of this worked, but we color-coded everything to explain what’s actually going on.
- PGD nodes and the traffic between them is indicated in orange.
- The DCS (etcd) and its internal traffic is shown in purple.
- HARP Manager and its communication between PGD is blue.
- HARP Router and traffic to PGD nodes is illustrated in green.
- Traffic to and from the DCS is denoted in red.
The HARP Manager service continuously updates the status of each node in the DCS, and HARP Router obtains the status of the appropriate Lead node from the DCS. Simple. We used three nodes In this particular example because it’s the minimum necessary to maintain a voting quorum.
It’s also convenient for service distribution, since we can place one of each necessary item on each node. It’s not uncommon to split all of these components up into designated roles. The DCS for example, may be assigned to dedicated nodes so the service can be shared among several clusters. The HARP Router services can be likewise decoupled from their respective nodes, allowing for better conceptual abstraction or integration into container-driven environments.
The best part about this design is that it works extremely well. Since we control both the Router and Manager components, we can ensure proper failover based on replication lag, the state of the Commit At Most Once (CAMO) queue, or any number of requirements to prevent conflicts. The primary concerns about this architecture lay in its moderate complexity and dependence on so many moving parts, some of which are external to EDB.
Moving to PGD Proxy
So how could EDB address both the complexity of the HARP system and its reliance on so many components? Integration! We discussed our new approach to HA routing in more depth with a blog article around the release date of PGD 5.0, but how did we get here?
Raft is one of the more notorious consensus algorithms currently in use across various infrastructure stacks. It’s the system used by etcd, which is in turn integrated into Kubernetes, and is the backbone of many of the infamous HashiCorp tools such as Consul, Nomad, and Vault. So it should be no surprise that Raft also serves as the consensus model for PGD.
Early iterations of HARP depended on etcd because PGD did not expose its Raft implementation to API calls. Eventually we added Raft API calls to PGD so HARP could use PGD as its DCS instead of etcd, which removed one point of contention. PGD 5.0 also fully integrated the duties formerly held by HARP Manager. Why have an external process obtain node status and separately update the DCS, when the node and the DCS are one and the same?
Thus the new design looks more like this:
Compared to the HARP approach, we’ve purged the purple DCS traffic entirely, and the red DCS communication is more direct. PGD Proxy subscribes to PGD status changes through a standard Postgres LISTEN / NOTIFY channel, and PGD updates proxy-related keys as part of its standard operating protocols. Using LISTEN and NOTIFY is important, as it prevents excessive polling which would otherwise add additional traffic to the Raft layer and potentially disrupt PGD node management.
The final improvement related to connection routing derives from Raft itself. Despite PGD making its API available to HARP, clients often chose to use etcd as the consensus layer anyway. Why might that be?
Our most commonly deployed architecture designates two PGD nodes in a Lead / Shadow arrangement per Location. Whether this location is an Availability Zone, Datacenter, Region, or otherwise doesn’t matter. What’s important is that each Location can have its own etcd cluster, meaning HARP Router will still function even if half of the PGD cluster itself is unavailable due to a Network Partition. If PGD is the DCS, losing half of the cluster means the other half will no longer be routable, as the DCS will refuse to answer queries without a quorum.
That is no longer the case with PGD 5.0, as we’ve added the capability to create Raft sub-groups. As a result, we can construct a cluster like this:
This isn’t the best cluster design since it uses an even number of nodes, but it illustrates the point. In this scenario, the PGD Proxy in AZ1 would still continue to route traffic to the Lead node there even if AZ2 vanishes. The Raft sub-group ensures a Local consensus is possible even when the Global consensus is unknown. Being independent from Global consensus also removes remote consensus latency, allowing for faster failover within the sub-group (lower RTO) than before.
It may not seem groundbreaking, but combined with DCS change notifications, it allows us to entirely abandon our previous reliance on etcd for certain functionality. Now there is nothing etcd can do that PGD itself can’t accomplish, all without the extra (and sometimes substantial) configuration and maintenance etcd itself requires.
An Integrated Stack
HARP started out as an innovative concept for Postgres HA adopted to a context suitable for PGD. PGD 5.0 takes that a step further by fully embracing the concept and making DCS state part of the extension itself, exactly as the original roadmap ordained. The goal has always been to transform Postgres into a cluster-aware database engine, rather than a database engine that could be clustered.
We now have a properly decoupled design in EDB Postgres Distributed 5.0: a database cluster, and something to connect to it. This is similar to how Oracle works with its Listener service with one important enhancement even Oracle never considered: consensus. Why use some obscure configuration syntax to choose the correct node when the nodes themselves can simply inform the Proxy where traffic should go?
We’ve crossed the Rubicon here, and we’re only getting started.