EDB Docs - EDB Postgres® AI for CloudNativePG™ Cluster v1.30.0

In the case of unexpected errors on the primary for longer than the .spec.failoverDelay (by default 0 seconds), the cluster will go into failover mode. This may happen, for example, when:

The primary pod has a disk failure
The primary pod is deleted
The postgres container on the primary has any kind of sustained failure

In the failover scenario, the primary cannot be assumed to be working properly.

After cases like the ones above, the readiness probe for the primary pod will start failing. This will be picked up in the controller's reconciliation loop. The controller will initiate the failover process, in two steps:

First, it will mark the TargetPrimary as pending. This change of state will force the primary pod to shutdown, to ensure the WAL receivers on the replicas will stop. The cluster will be marked in failover phase ("Failing over").
Once all WAL receivers are stopped, there will be a leader election, and a new primary will be named. The chosen instance will initiate promotion to primary, and, after this is completed, the cluster will resume normal operations. Meanwhile, the former primary pod will restart, detect that it is no longer the primary, and become a replica node.

Important

The two-phase procedure helps ensure the WAL receivers can stop in an orderly fashion, and that the failing primary will not start streaming WALs again upon restart. These safeguards prevent timeline discrepancies between the new primary and the replicas.

During the time the failing primary is being shut down:

It will first try a PostgreSQL's fast shutdown with .spec.switchoverDelay seconds as timeout. This graceful shutdown will attempt to archive pending WALs.
If the fast shutdown fails, or its timeout is exceeded, a PostgreSQL's immediate shutdown is initiated.

Info

"Fast" mode does not wait for PostgreSQL clients to disconnect and will terminate an online backup in progress. All active transactions are rolled back and clients are forcibly disconnected, then the server is shut down. "Immediate" mode will abort all PostgreSQL server processes immediately, without a clean shutdown.

Safe primary election

To ensure that at most one instance promotes itself to primary at any given time, EDB Postgres® AI for CloudNativePG™ Cluster backs the election described above with a Kubernetes Lease object, named after the Cluster and living in its namespace. An instance must acquire and hold this lease before it promotes to primary. On a clean shutdown the former primary releases the lease, so an eligible replica can take over without waiting for the lease to expire.

What the primary lease protects against

Consider a replica catching up by reading WAL files from the archive rather than streaming from the previous primary. PostgreSQL stops replaying as soon as the archive returns "file not found" for the next expected segment, treating it as the end of the WAL stream. If a replica promotes while the previous primary still has WAL it has not finished archiving, that signal arrives before the true end of the stream: the replica forks a new timeline at an LSN earlier than the last writes the previous primary acknowledged, and those writes are lost.

The lease holds promotion back until the previous primary releases it. On a clean shutdown the release happens after PostgreSQL has flushed and archived its remaining WAL, so the replica that takes over sees the archive at its definitive end. If the previous primary cannot release the lease (crash, node failure, or the instance manager itself unreachable), the lease expires after leaseDurationSeconds elapses and the replica can promote. In that path the archive may not have caught up, and any writes the previous primary did not finish archiving are lost.

Relationship with the primary isolation check

The lease does not fence a primary that has lost connectivity to the Kubernetes API server but is still running; that is the job of the primary isolation check. The two mechanisms are complementary and both are enabled by default:

The lease prevents premature promotion: a replica cannot promote while the former primary still holds the lease.
The isolation check stops an isolated primary from continuing to accept writes.

Keep both enabled. Disabling the isolation check leaves the lease alone responsible for primary safety, and the lease alone cannot prevent split-brain when the former primary cannot reach the API server but remains otherwise healthy.

Inspecting the primary lease

The lease shares the cluster's name, so you can inspect it directly. For a cluster called cluster-example:

kubectl get lease cluster-example

The HOLDER column reports the pod that currently holds the lease, which is the current primary:

NAME              HOLDER              AGE
cluster-example   cluster-example-1   5m

For the full picture, including the lease duration in effect and the last renewal time, use:

kubectl get lease cluster-example -o yaml

Tuning the primary lease

The lease timings are exposed under .spec.primaryLease and default to values suitable for most clusters. They map directly onto the underlying Kubernetes leader-election parameters.

Field	Default	Description
`leaseDurationSeconds`	`15`	How long the lease is valid before another instance may acquire it.
`renewDeadlineSeconds`	`10`	How long the primary keeps retrying to renew the lease before giving up.
`retryPeriodSeconds`	`2`	How frequently a non-holder retries acquiring or renewing the lease.
`releasedLeaseDurationSeconds`	`1`	TTL written when the primary releases the lease on a clean shutdown.

For example, to make the cluster more tolerant of a slow or briefly unreachable API server:

spec:
  primaryLease:
    leaseDurationSeconds: 60
    renewDeadlineSeconds: 40
    retryPeriodSeconds: 15

Warning

Tune these values only if you understand the impact on failover timing: longer intervals make the cluster more tolerant of transient API server unavailability but slow down legitimate promotions. Two invariants are enforced by the admission webhook: leaseDurationSeconds must be greater than renewDeadlineSeconds, and renewDeadlineSeconds must be greater than retryPeriodSeconds multiplied by 1.2. Both mirror the requirements of the underlying Kubernetes leader election.

Note

leaseDurationSeconds and retryPeriodSeconds govern two different timings. After an abrupt primary loss (the previous primary did not release the lease), a candidate must observe the lease unchanged for a full leaseDurationSeconds before it may take over: this is what holds back a premature promotion while the former primary may still be alive. After a clean switchover (the previous primary released the lease), there is no such wait; the candidate simply notices the released lease on its next poll, so the hand-over latency is bounded by retryPeriodSeconds. Lowering retryPeriodSeconds speeds up switchover without shortening the take-over wait that guards against premature promotion, at the cost of more frequent lease renewals against the API server.

Note

The primary instance captures these timings the first time it acquires the lease. Changing .spec.primaryLease on a running cluster therefore takes effect only after the affected primary Pod restarts; until then the primary keeps using the values it started with.

RTO and RPO impact

Failover may result in the service being impacted (RTO) and/or data being lost (RPO):

During the time when the primary has started to fail, and before the controller starts failover procedures, queries in transit, WAL writes, checkpoints and similar operations, may fail.
Once the fast shutdown command has been issued, the cluster will no longer accept connections, so service will be impacted but no data will be lost.
If the fast shutdown fails, the immediate shutdown will stop any pending processes, including WAL writing. Data may be lost.
During the time the primary is shutting down and a new primary hasn't yet started, the cluster will operate without a primary and thus be impaired - but with no data loss.

Note

The timeout that controls fast shutdown is set by .spec.switchoverDelay, as in the case of a switchover. Increasing the time for fast shutdown is safer from an RPO point of view, but possibly delays the return to normal operation - negatively affecting RTO.

Warning

As already mentioned in the "Instance Manager" section when explaining the switchover process, the .spec.switchoverDelay option affects the RPO and RTO of your PostgreSQL database. Setting it to a low value, might favor RTO over RPO but lead to data loss at cluster level and/or backup level. On the contrary, setting it to a high value, might remove the risk of data loss while leaving the cluster without an active primary for a longer time during the switchover.

Delayed failover

As anticipated above, the .spec.failoverDelay option allows you to delay the start of the failover procedure by a number of seconds after the primary has been detected to be unhealthy. By default, this setting is set to 0, triggering the failover procedure immediately.

Sometimes failing over to a new primary can be more disruptive than waiting for the primary to come back online. This is especially true of network disruptions where multiple tiers are affected (i.e., downstream logical subscribers) or when the time to perform the failover is longer than the expected outage.

Enabling a new configuration option to delay failover provides a mechanism to prevent premature failover for short-lived network or node instability.

Detection of node-level failures

When the node hosting the primary becomes unreachable (for example, due to a kubelet crash or a network partition between the node and the Kubernetes API server), the operator relies on the pod's Ready condition to decide that the primary is no longer serviceable. While the node is healthy the kubelet keeps that condition up to date from the readiness probe; once the node stops reporting, the Kubernetes node lifecycle controller is the one that flips the condition to False as soon as it declares the node Unknown.

With stock kube-controller-manager settings, the transition is governed by --node-monitor-grace-period (default 40s on Kubernetes 1.29-1.31, raised to 50s in 1.32 and later): after that window the controller marks the node Unknown and, in the same monitoring pass, issues a patch per pod on that node to flip the Ready condition. In practice the operator observes the primary as unready about 40 to 55 seconds after the node becomes unreachable (the grace period plus up to one --node-monitor-period poll, default 5s). Managed Kubernetes distributions (GKE, EKS, AKS) may tune these values; consult the provider's documentation if the observed timing does not match. After that, the failover procedure starts (further gated by .spec.failoverDelay).

The Ready condition flip is not subject to the rate limiters that throttle pod eviction during partial-zonal or large-cluster disruptions (--node-eviction-rate, --secondary-node-eviction-rate, --unhealthy-zone-threshold). The operator reacts to the condition flip as soon as the controller emits the patch, regardless of the zone or cluster-wide health state.

Pod eviction (actual deletion from the unreachable node) is a separate mechanism, driven by tolerationSeconds on the node.kubernetes.io/unreachable NoExecute taint (300s by default). That timer does not hold up the operator's failover decision; EDB Postgres® AI for CloudNativePG™ Cluster promotes a new primary as soon as the Ready condition flips. By that point the kubelet on the isolated node has already stopped the old PostgreSQL container locally: with the default .spec.probes.liveness.isolationCheck.enabled: true, the instance manager fails its own liveness probe once it can reach neither the API server nor the rest of the cluster, and the kubelet kills the container within approximately three probe periods (~30s). Full high availability (recreation of the old primary on a healthy node by the operator) is still gated on the taint-based eviction actually deleting the pod.

Failover Quorum (Quorum-based Failover)

Failover quorum is a mechanism that enhances data durability and safety during failover events in EDB Postgres® AI for CloudNativePG™ Cluster-managed PostgreSQL clusters.

Quorum-based failover allows the controller to determine whether to promote a replica to primary based on the state of a quorum of replicas. This is useful when stronger data durability is required than the one offered by synchronous replication and default automated failover procedures.

When synchronous replication is not enabled, some data loss is expected and accepted during failover, as a replica may lag behind the primary when promoted.

With synchronous replication enabled, the guarantee is that the application will not receive explicit acknowledgment of the successful commit of a transaction until the WAL data is known to be safely received by all required synchronous standbys. This is not enough to guarantee that the operator is able to promote the most advanced replica.

For example, in a three-node cluster with synchronous replication set to ANY 1 (...), data is written to the primary and one standby before a commit is acknowledged. If both the primary and the aligned standby become unavailable (such as during a network partition), the remaining replica may not have the latest data. Promoting it could lose some data that the application considered committed.

Quorum-based failover addresses this risk by ensuring that failover only occurs if the operator can confirm the presence of all synchronously committed data in the instance to promote, and it does not occur otherwise.

This feature allows users to choose their preferred trade-off between data durability and data availability.

Failover quorum can be enabled by setting the .spec.postgresql.synchronous.failoverQuorum field to true:

apiVersion: postgresql.k8s.enterprisedb.io/v1
kind: Cluster
metadata:
  name: cluster-example
spec:
  instances: 3

  postgresql:
    synchronous:
      method: any
      number: 1
      failoverQuorum: true

  storage:
    size: 1Gi

For backward compatibility, the legacy annotation alpha.k8s.enterprisedb.io/failoverQuorum is still supported by the admission webhook and takes precedence over the Cluster spec option:

If the annotation evaluates to "true" and a synchronous replication stanza is present, the webhook automatically sets .spec.postgresql.synchronous.failoverQuorum to true.
If the annotation evaluates to "false", the feature is always disabled

Important

Because the annotation overrides the spec, we recommend that users of this experimental feature migrate to the native .spec.postgresql.synchronous.failoverQuorum option and remove the annotation from their manifests. The annotation is deprecated and will be removed in a future release.

How it works

Before promoting a replica to primary, the operator performs a quorum check, following the principles of the Dynamo R + W > N consistency model¹.

In the quorum failover, these values assume the following meaning:

R is the number of promotable replicas (read quorum);
W is the number of replicas that must acknowledge the write before the COMMIT is returned to the client (write quorum);
N is the total number of potentially synchronous replicas;

Promotable replicas are replicas that have these properties:

are part of the cluster;
are able to report their state to the operator;
are potentially synchronous;

If R + W > N, then we can be sure that among the promotable replicas there is at least one that has confirmed all the synchronous commits, and we can safely promote it to primary. If this is not the case, the controller will not promote any replica to primary, and will wait for the situation to change.

Users can force a promotion of a replica to primary through the kubectl cnp promote command even if the quorum check is failing.

Warning

Manual promotion should only be used as a last resort. Before proceeding, make sure you fully understand the risk of data loss and carefully consider the consequences of prioritizing the resumption of write workloads for your applications.

An additional CRD is used to track the quorum state of the cluster. A Cluster with the quorum failover enabled will have a FailoverQuorum resource with the same name as the Cluster resource. The FailoverQuorum CR is created by the controller when the quorum failover is enabled, and it is updated by the primary instance during its reconciliation loop, and read by the operator during quorum checks. It is used to track the latest known configuration of the synchronous replication.

Important

Users should not modify the FailoverQuorum resource directly. During PostgreSQL configuration changes, when it is not possible to determine the configuration, the FailoverQuorum resource will be reset, preventing any failover until the new configuration is applied.

The FailoverQuorum resource works in conjunction with PostgreSQL synchronous replication.

Warning

There is no guarantee that COMMIT operations returned to the client but that have not been performed synchronously, such as those made explicitly disabling synchronous replication with SET synchronous_commit TO local, will be present on a promoted replica.

Quorum Failover Example Scenarios

In the following scenarios, R is the number of promotable replicas, W is the number of replicas that must acknowledge a write before commit, and N is the total number of potentially synchronous replicas. The "Failover" column indicates whether failover is allowed under quorum failover rules.

Scenario 1: Three-node cluster, failing pod(s)

A cluster with instances: 3, synchronous.number=1, and dataDurability=required.

If only the primary fails, two promotable replicas remain (R=2). Since R + W > N (2 + 1 > 2), failover is allowed and safe.
If both the primary and one replica fail, only one promotable replica remains (R=1). Since R + W = N (1 + 1 = 2), failover is not allowed to prevent possible data loss.

R	W	N	Failover
2	1	2	✅
1	1	2	❌

Scenario 2: Three-node cluster, network partition

A cluster with instances: 3, synchronous.number: 1, and dataDurability: required experiences a network partition.

If the operator can communicate with the primary, no failover occurs. The cluster can be impacted if the primary cannot reach any standby, since it won't commit transactions due to synchronous replication requirements.
If the operator cannot reach the primary but can reach both replicas (R=2), failover is allowed. If the operator can reach only one replica (R=1), failover is not allowed, as the synchronous one may be the other one.

R	W	N	Failover
2	1	2	✅
1	1	2	❌

Scenario 3: Five-node cluster, network partition

A cluster with instances: 5, synchronous.number=2, and dataDurability=required experiences a network partition.

If the operator can communicate with the primary, no failover occurs. The cluster can be impacted if the primary cannot reach at least two standbys, as since it won't commit transactions due to synchronous replication requirements.
If the operator cannot reach the primary but can reach at least three replicas (R=3), failover is allowed. If the operator can reach only two replicas (R=2), failover is not allowed, as the synchronous one may be the other one.

R	W	N	Failover
3	2	4	✅
2	2	4	❌

Scenario 4: Three-node cluster with remote synchronous replicas

A cluster with instances: 3 and remote synchronous replicas defined in standbyNamesPre or standbyNamesPost. We assume that the primary is failing.

This scenario requires an important consideration. Replicas listed in standbyNamesPre or standbyNamesPost are not counted in R (they cannot be promoted), but are included in N (they may have received synchronous writes). So, if synchronous.number <= len(standbyNamesPre) + len(standbyNamesPost), failover is not possible, as no local replica can be guaranteed to have the required data. The operator prevents such configurations during validation, but some invalid configurations are shown below for clarity.

Example configurations:

Configuration #1 (valid):

instances: 3
postgresql:
  synchronous:
    method: any
    number: 2
    standbyNamesPre:
      - angus

In this configuration, when the primary fails, R = 2 (the local replicas), W = 2, and N = 3 (2 local replicas + 1 remote), allowing failover. In case of an additional replica failing (R = 1) failover is not allowed.

R	W	N	Failover
3	2	4	✅
2	2	4	❌

Configuration #2 (invalid):

instances: 3
postgresql:
  synchronous:
    method: any
    number: 1
    maxStandbyNamesFromCluster: 1
    standbyNamesPre:
      - angus

In this configuration, R = 2 (the local replicas), W = 1, and N = 3 (2 local replicas + 1 remote). Failover is not possible in this setup, so quorum failover can not be enabled with this configuration.

R	W	N	Failover
1	1	2	❌

Configuration #3 (invalid):

instances: 3
postgresql:
  synchronous:
    method: any
    number: 1
    maxStandbyNamesFromCluster: 0
    standbyNamesPre:
      - angus
      - malcolm

In this configuration, R = 0 (the local replicas), W = 1, and N = 2 (0 local replicas + 2 remote). Failover is not possible in this setup, so quorum failover can not be enabled with this configuration.

R	W	N	Failover
0	1	2	❌

Scenario 5: Three-node cluster, preferred data durability, network partition

Consider a cluster with instances: 3, synchronous.number=1, and dataDurability=preferred that experiences a network partition.

If the operator can communicate with both the primary and the API server, the primary continues to operate, removing unreachable standbys from the synchronous_standby_names set.
If the primary cannot reach the operator or API server, a quorum check is performed. The FailoverQuorum status cannot have changed, as the primary cannot have received new configuration. If the operator can reach both replicas, failover is allowed (R=2). If only one replica is reachable (R=1), failover is not allowed.

R	W	N	Failover
2	1	2	✅
1	1	2	❌

Dynamo: Amazon’s highly available key-value store ↩

Automated failover v1.30.0

Important

Info

Safe primary election

What the primary lease protects against

Relationship with the primary isolation check

Inspecting the primary lease

Tuning the primary lease

Warning

Note

Note

RTO and RPO impact

Note

Warning

Delayed failover

Detection of node-level failures

Failover Quorum (Quorum-based Failover)

Important

How it works

Warning

Important

Warning

Quorum Failover Example Scenarios

Scenario 1: Three-node cluster, failing pod(s)

Scenario 2: Three-node cluster, network partition

Scenario 3: Five-node cluster, network partition

Scenario 4: Three-node cluster with remote synchronous replicas

Scenario 5: Three-node cluster, preferred data durability, network partition

← Prev

↑ Up

Next →