Validating Raft consistency in EDB Postgres Distributed

April 12, 2024

Introduction

EDB Postgres Distributed (PGD) provides best-in-class high availability required for always-on, enterprise-grade Postgres database application performance across fully managed or self-managed deployments in any cloud. With EDB’s Postgres technology supporting robust, globally distributed applications that process thousands of transactions per second, customers regard EDB Postgres Distributed as their go-to for delivering up to 99.999% guaranteed uptime and improving the performance of data replication by up to 5X.

As part of internal activities leading up to our latest EDB Postgres Distributed release, our  Engineering team focused on how to deliver more robust releases that enhance customer confidence in PGD. Setting correct expectations for customers prompted us to enhance our internal testing processes in efforts influenced by the Jepsen database testing framework.

Validating EDB Postgres distributed safety, correctness, and consistency

In the Jepsen framework, a Jepsen control node has the Jepsen libraries, together with plugins for PGD and tests. This talks to the PGD cluster as shown below.

This blog is the first of a series of posts, as EDB extends our commitment to validate PGD’s availability and consistency in geo-distributed Postgres data architectures.

In the subsections that follow, we’ll walk through our internal testing and how it validates EDB Postgres Distributed’s availability and consistency in active/active architectures.

Background and Terminology

Before describing our internal PGD testing enhancements, let’s provide some background and terminology references. 

Operation: An operation (e.g., read or write) is carried out by a client. This operation can take some time to complete. A modifying operation such as a write can take effect sometime between its invocation and completion. Multiple operations can be in progress simultaneously.

Process: An operation is performed by a logical process. A process can do only one operation at a time.

History: A history is a set of operation invocations and completions with their timestamps. An example is provided below.

{:process 0, :type :invoke, :f :read, :value nil}: invocation process 0

{:process 1, :type :invoke, :f :write, :value 3}: invocation process 1

{:process 1, :type :info, :f :write, :value 3}: completion process 1

{:process 0, :type :ok, :f :read, :value 3}: completion process 0

Testing Linearizability

We used the Knossos checker to verify that a history generated from a set of concurrent operations is linearizable. It models a register that has operations read, write, and compare-and-swap (CAS). And given a set of concurrent operations, if there is at least one possible path in the history that is linearizable, then the system is validated as linearizable for that set.

Linearizable System

A linearizable implementation implies that the following properties are met:

  1. The system should support these operations:
    1. read(key)
    2. write(key, value)
    3. compare_and_swap(key, old_value, new_value)
  2. In the presence of multiple concurrent reads and writes from multiple nodes, a read should never return a stale value. A stale read is one that returns, say, a value V, and there is at least one modification to the key that was successful and completed before the key was read that set the value to V2.
  3. As the system moves from one state to another, the state of the system seen should always progress forward. Returning a stale value is like going backward.
  4. There should be a total order on all the operations.

In a distributed system, an operation is applied to the system at a certain time. It may take some time to complete and if it is a modifying operation, the modification takes effect sometime between start time and completion time. The diagrams that follow show how operations proceed concurrently on a distributed system.

The first diagram below provides an example of non-linearizable behavior. After a write has been applied and completed, a subsequent read returns an older value. A linearizable system cannot go “backward”.

The second diagram below shows the behavior of a linearizable system. There is a total order on the operations because the system moves from a state of A = 2, A = 4, A = 8… and clients see the system moving forward. An operation takes some time to take effect, but once it takes effect, every node sees the effect of that operation. Thus, the system behaves as one system.

A look at EDB’s implementation of Raft in PGD

PGD uses Raft as a consensus protocol for certain key operations. While Raft does not come in the data path of transactions, it is used for configuration management, electing a write leader for a subgroup, serializing DDL and global locks, and arriving at consensus in distributed transaction reconciliation.

One of our first steps was to test the linearizability of PGD’s Raft implementation We used a key-value API provided by the Raft subsystem to test linearizable behavior. The implementation of the read, write, and CAS calls goes through Raft.

A failed operation appears in pink, and a successful operation appears in blue. Operations may also return with uncertainty, which means it may or may not have been executed, and this is factored by the Knossos checkers.

Using this API to test linearizability yields timeline charts provided in the sample exhibits that follow.

Example of Correct Execution

Example of Invalid Analysis

Linearizable Reads and Raft

The tests showed some processes reading stale value as seen in the above example. This is a bug or at least a linearizability violation. The reason is that key-value Raft implementation in PGD reads from the leader. The leader should have the latest value because it acknowledges the request only after getting an acknowledgement from the majority of replicas that they have committed the change, and then applying the changed state locally. But, in the event there is a leader change, a new leader is elected. The new leader must have the change committed in its raft logs, but it may not have been applied to the key value store.  For the new leader, the apply_index may lag the commit_index. And on a read, it may return a value that is stale.

PGD code that uses Raft for various use cases, such as configuration management, waits if it needs the applied value to be seen locally. We created an explicit SQL call to wait until the apply index catches up with the commit index, and when using this, the test succeeded.

Lessons Learned

During this process, we realized some valuable lessons.

  1. An API definition needs to be very precise and allow callers to infer if something has definitely failed or there is uncertainty. For example, throwing errors with right details, such as request failing is due to a timeout expiring, allows the caller to infer that there is uncertainty in the result of the request. It may or may not have succeeded. For tests to accurately check results, the requests need to return :ok (success), :fail (failure) and :info (uncertainty). An :info result can mean the request could have succeeded or failed. Wrongly considering a request as failed when it could have succeeded gives false alarms in the tests.
  2. Some of the tests that use register workload with the Knossos checker have very high space and time complexities. If the test runs with a fewer number of keys, it can take a very long time to verify, and memory as well. Its duration needs to be short. If the duration is long, the verification process can go out of memory. If duration is kept short, bugs may not get caught. Also, if the number of keys is increased, the bug may not be hit in the test. Similarly, creating a lot of contention can result in the code hitting deadlocks and slowing down the progress of the test. We are working to hit a balance between these to make sure the code gets tested better.

Status and work-in-progress activities

As a result of writing and running these tests, we were able to target a few areas of next-stage focus, including: 

  1. Helping identify some rare cases when update conflicts, in the presence of write skew, causing data divergence among nodes.
  2. Improving PGD error handling on replicas in the presence of conflicts.
  3. Helping identify gaps in eager and async conflict resolution.

We continue to refine our efforts to validate PGD’s availability and consistency in geo-distributed architectures with various configurations, and we’ll share those findings in future blogs, focusing on:

  1. Conflict resolution with CRDTs
  2. Qualify Write Lead with Majority Commit Scope for Strict Serializability
  3. Qualify Write lead with synchronous commit for Linearizable
  4. Adding tests for correctness of global locks

 For more information about the latest EDB Postgres Distributed release, check out our EDB Docs

Share this

Relevant Blogs

More Blogs