Consensus layer considerations v4

HARP is designed so that it can work with different implementations of consensus layer, also known as Distributed Control Systems (DCS).

Currently the following DCS implementations are supported:

  • etcd
  • BDR

This information is specific to HARP's interaction with the supported DCS implementations.

BDR driver compatibility

The bdr native consensus layer is available from BDR versions 3.6.21 and 3.7.3.

For the purpose of maintaining a voting quorum, BDR Logical Standby nodes and BDR Subscriber-Only nodes don't participate in consensus communications in a EDB Postgres Distributed cluster. Don't count these in the total node list to fulfill DCS quorum requirements.

Maintaining quorum

Clusters of any architecture require at least n/2 + 1 nodes to maintain consensus via a voting quorum. Thus a three-node cluster can tolerate the outage of a single node, a five-node cluster can tolerate a two-node outage, and so on. If consensus is ever lost, HARP becomes inoperable because the DCS prevents it from deterministically identifying the node that is the lead master in a particular location.

As a result, whichever DCS is chosen, more than half of the nodes must always be available cluster-wide. This can become a non-trivial element when distributing DCS nodes among two or more data centers. A network partition prevents quorum in any location that can't maintain a voting majority, and thus HARP stops working.

Thus an odd-number of nodes (with a minimum of three) is crucial when building the consensus layer. An ideal case distributes nodes across a minimum of three independent locations to prevent a single network partition from disrupting consensus.

One example configuration is to designate two DCS nodes in two data centers coinciding with the primary BDR nodes, and a fifth DCS node (such as a BDR witness) elsewhere. Using such a design, a network partition between the two BDR data centers doesn't disrupt consensus thanks to the independently located node.

Multi-consensus variant

HARP assumes one lead master per configured location. Normally each location is specified in HARP using the location configuration setting. By creating a separate DCS cluster per location, you can emulate this behavior independently of HARP.

To accomplish this, configure HARP in config.yml to use a different DCS connection target per desired Location.

HARP nodes in DC-A use something like this:

location: dca
dcs:
  driver: etcd
  endpoints:
    - dcs-a1:2379
    - dcs-a2:2379
    - dcs-a3:2379

While DC-B uses different hostnames corresponding to nodes in its canonical location:

location: dcb
dcs:
  driver: etcd
  endpoints:
    - dcs-a1:2379
    - dcs-a2:2379
    - dcs-a3:2379

There's no DCS communication between different data centers in this design, and thus a network partition between them doesn't affect HARP operation. A consequence of this is that HARP is completely unaware of nodes in the other location, and each location operates essentially as a separate HARP cluster.

This isn't possible when using BDR as the DCS, as BDR maintains a consensus layer across all participant nodes.

A possible drawback to this approach is that harpctl can't interact with nodes outside of the current location. It's impossible to obtain node information, get or set the lead master, or perform any other operation that targets the other location. Essentially this organization renders the --location parameter to harpctl unusable.

TPAexec and consensus

These considerations are integrated into TPAexec as well. When deploying a cluster using etcd, it constructs a separate DCS cluster per location to facilitate high availability in favor of strict consistency.

Thus this configuration example groups any DCS nodes assigned to the first location together, and the second location is a separate cluster:

cluster_vars:
  failover_manager: harp
  harp_consensus_protocol: etcd

locations:
  - Name: first
  - Name: second

To override this behavior, configure the harp_location implicitly to force a particular grouping.

Thus this example returns all etcd nodes into a single cohesive DCS layer:

cluster_vars:
  failover_manager: harp
  harp_consensus_protocol: etcd

locations:
  - Name: first
  - Name: second
  - Name: all_dcs

instance_defaults:
  vars:
    harp_location: all_dcs

The harp_location override might also be necessary to favor specific node groupings when using cloud providers such as Amazon that favor availability zones in regions over traditional data centers.

Leadership consensus storage

Leader elections

The process of selecting a leader begins with the startup of the first HARP manager:

  • Starts Postgres on the node.
  • Reads the Postgres status, including any data on replication lag.
  • Queries for the current leader. No current leader must be returned for this node to become leader.
  • Stores the leadership request in the DCS.
  • If the leadership request isn't rejected, stores the leader lease in the DCS.
  • Refreshes the leader lease. This is a configurable interval with the default set at 500ms.

Once established, subsequent HARP manager instances start up in a similar fashion:

  • Starts Postgres on the node.
  • Reads the Postgres status, including any data on replication lag.
  • Queries the current leader. A current leader must be returned for this node to become a follower.
  • As there's a leader, the manager assumes a shadow role and monitors the leader lease.
  • If the lease expires, follows the path of the first HARP manager instance to claim leadership.

When HARP Proxy instances start, they configure themselves with this logic:

  • Subscribe to and watch for key changes in the DCS.
  • If the leader changes, reconfigure routing.

The diagram shows the sequence of events:

Sequence Diagram

Detection and recovery

When partitions in availability zones (AZ) occur, the process detailed in Leader elections leads to the following behaviors:

  • Partition of all AZ
    • Leader lease expires.
    • No leader can be elected, and routing stops.
  • Partition of an AZ not containing the write leader
    • Normal operations continue as long as the write leader has a majority.
  • Recovery
    • If no leader exists:
      • Wait for consensus to be reestablished.
      • Elect a new leader.
      • Start routing.
    • Replication replays outstanding transactions kept in replications slots.