Group Commit v5

Commit scope kind: GROUP COMMIT

Overview

The goal of Group Commit is to protect against data loss in case of single node failures or temporary outages. You achieve this by requiring more than one PGD node to successfully confirm a transaction at COMMIT time. Confirmation can be sent at a number of points in the transaction processing, but defaults to "visible" when the transaction has been flushed to disk and is visible to all other transactions.

Example

SELECT bdr.add_commit_scope(
    commit_scope_name := 'example_scope',
    origin_node_group := 'left_dc',
    rule := 'ALL (left_dc) GROUP COMMIT(commit_decision=raft) AND ANY 1 (right_dc) GROUP COMMIT',
    wait_for_ready := true
);

This example creates a commit scope where all the nodes in the left_dc group and any one of the nodes in the right_dc group must receive and successfuly confirm a committed transaction.

Requirements

During normal operation, Group Commit is transparent to the application. Transactions that were in progress during failover need the reconciliation phase triggered or consolidated by either the application or a proxy in between. This activity currently happens only when either the origin node recovers or when it's parted from the cluster. This behavior is the same as with Postgres legacy built-in synchronous replication.

Transactions committed with Group Commit use two-phase commit underneath. Therefore, configure max_prepared_transactions high enough to handle all such transactions originating per node.

Limitations

See the Group Commit section of the Limitations section.

Configuration

To use Group Commit, first define a commit scope. The commit scope determines the PGD nodes involved in the commit of a transaction.

Confirmation

Confirmation LevelGroup Commit Handling
receivedA remote PGD node confirms the transaction immediately after receiving it, prior to starting the local application.
replicatedConfirms after applying changes of the transaction but before flushing them to disk.
durableConfirms the transaction after all of its changes are flushed to disk.
visible (default)Confirms the transaction after all of its changes are flushed to disk and it's visible to concurrent transactions.

Behavior

The behavior of Group Commit depends on the configuration applied by the commit scope.

Commit decisions

You can configure Group Commit to decide commits in three different ways: group, partner, and raft.

The group decision is the default. It specifies that the commit is confirmed by the origin node upon it recieving as many confirmations as required by the commit scope group. The difference is that the commit decision is made based on PREPARE replication while the durability checks COMMIT (PREPARED) replication.

The partner decision is what Commit At Most Once uses. This approach works only when there are two data nodes in the node group. These two nodes are partners of each other, and the replica rather than origin decides whether to commit something. This approach requires application changes to use the CAMO transaction protocol to work correctly, as the application is in some way part of the consensus. For more on this approach, see CAMO.

The raft decision uses PGDs built-in raft consensus for commit decisions. Use of the raft decision can reduce performance. It is currently only required when using GROUP COMMIT with an ALL commit scope group.

Using an ALL commit scope group requires that the commit decision must be set to raft to avoid reconciliation issues.

Conflict resolution

Conflict resolution can be async or eager.

Async means that PGD does optimistic conflict resolution during replication using the row-level resolution as configured for given node. This happens regardless of whether the origin transaction committed or is still in progress. See Conflicts for details about how the asynchronous conflict resolution works.

Eager means that conflicts are resolved eagerly (as part of agreement on COMMIT), and conflicting transactions get aborted with a serialization error. This approach provides greater isolation than the asynchronous resolution at the price of performance.

Using an ALL commit scope group requires that the commit decision must be set to raft to avoid reconciliation issues.

For the details about how Eager conflict resolution works, see Eager conflict resolution.

Aborts

To prevent a transaction that can't get consensus on the COMMIT from hanging forever, the ABORT ON clause allows specifying timeout. After the timeout, the transaction abort is requested. If the transaction is already decided to be committed at the time the abort request is sent, the transaction does eventually COMMIT even though the client might receive an abort message.

See also Limitations.

Transaction Reconciliation

A group commit transaction’s commit on the origin node is implicitly converted into a two phase commit.

In the first phase (prepare) the transaction is prepared locally and made ready to commit. The data is made durable but uncomitted at this stage so other transactions cannot see the changes made by this transaction. This prepared transaction gets copied to all remaining nodes through normal logical replication.

The origin node seeks confirmations from other nodes, as per rules in the group commit grammar. If it gets confirmations from minimum required nodes in the cluster, it decides to commit this transaction moving onto the second phase (commit) where it also sends this decision via replication to other nodes which will also eventually commit on getting this message.

There is a possibility of failure at various stages. For example, the origin node may crash after preparing the transaction. Or the origin and one or more replicas may crash.

This leaves the prepared transactions in the system. The pg_prepared_xacts view in Postgres can show prepared transactions on a system. The prepared transactions could be holding locks and other resources and they therefore need to be either aborted or committed. That decision has to be made with a consensus of nodes.

When commit_decision is raft then, raft acts as the reconcilator and these transactions are eventually, automatically, reconciled.

When the commit_decision is group then, transactions do not use raft. Instead the write lead in the cluster performs the role of reconciliator. This is because it is the node that is most ahead with respect to changes in its sub-group. It detects when a node is down and initiates reconciliation for such node by looking for prepared transactions it may have with the down node as the origin.

For all such transactions, it sees if the nodes as per the rules of the commit scope have the prepared transaction, it takes a decision. This decision is conveyed over raft and needs majority of the nodes to be up to do reconciliation.

This process happens in the background and there is no user command needed to control or issue this.