Handling replication apply errors v6.4.0

When a replication apply failure occurs, PGD retries the failed transaction indefinitely by default. Persistent failures can overload the system with repeated apply attempts and fill logs with recurring error messages.

Use the apply_error_policy node group option to control what PGD does when a transaction fails to apply. Set it to stop retrying after a set number of attempts, record failed changes for review, and either disable the subscription or skip the failed transaction and continue replicating.

Choosing an error policy

Four error policies are available, configured at the node group level:

  • keep_retrying retries the failed transaction indefinitely. PGD keeps retrying silently, nothing is recorded, and no subscription is disabled. Use for transient failures expected to self-heal, such as temporary lock contention. keep_retrying is the default.

  • disable_after_retries retries up to apply_error_max_retries times, then disables the subscription and records the reason. Use when failures can't persist indefinitely but might still be transient.

  • record_all_and_disable records the complete failed transaction, then disables the subscription. Use when you need a complete record of what failed before halting replication.

  • record_all_and_skip records the complete failed transaction, skips it, and continues replicating. The subscription stays enabled, but failed changes accumulate silently in bdr.pending_failed_changes. Use for high-availability deployments where continuity matters more than guaranteeing every transaction is applied.

The record_all_and_disable and record_all_and_skip policies capture the full transaction by replaying it in capture-only mode before taking action, giving you a complete record to inspect and resolve later.

Configuring the error policy

Set the policy on the node group using bdr.alter_node_group_option. To find your node group name:

SELECT node_group_name FROM bdr.local_node_summary;

Apply your chosen policy:

SELECT bdr.alter_node_group_option(
    'top_group',
    'apply_error_policy',
    'record_all_and_disable'
);

Configure the supporting parameters as needed. For example:

-- Retry attempts before the policy takes action (default: 3)
SELECT bdr.alter_node_group_option(
    'top_group',
    'apply_error_max_retries',
    '5'
);

-- Maximum transactions record_all_and_skip can auto-skip before disabling.
-- Set to -1 for unlimited. Default is NULL (unlimited).
SELECT bdr.alter_node_group_option(
    'top_group',
    'apply_error_max_skips',
    '100'
);

-- Maximum bytes to capture per failed transaction (default: 10485760, which is 10 MB)
SELECT bdr.alter_node_group_option(
    'top_group',
    'apply_error_max_record_size',
    '52428800'
);

For the full list of apply error group options, see bdr.alter_node_group_option.

Responding to a disabled subscription

When a subscription is disabled by the disable_after_retries or record_all_and_disable policy, replication stops. Follow these steps to investigate and resolve the failure.

  1. Check which subscriptions are affected and why:

    SELECT sub_name, disable_reason, effective_error_policy, pending_changes
    FROM bdr.subscription_error_status
    WHERE NOT sub_enabled;

    The disable_reason column records why PGD disabled the subscription, and pending_changes shows how many failed changes are waiting for review. For the full column reference, see bdr.subscription_error_status.

  2. Review the failed changes:

    SELECT id, failed_txn_id, subscription_name, remote_xid, remote_commit_lsn,
           operation, qualified_table, sql_statement, error_message
    FROM bdr.pending_failed_changes_sql
    WHERE change_status = 'pending'
    ORDER BY id;
  3. Optionally, analyze whether it's safe to skip. bdr.format_skip_analysis classifies the skip as SAFE, UNSAFE, or UNKNOWN and explains the reasoning:

    SELECT bdr.format_skip_analysis(id)
    FROM bdr.pending_failed_changes
    WHERE change_status = 'pending'
    LIMIT 1;
  4. Preview the resolution using bdr.resolve_failed_transaction, which defaults to dry_run := true:

    SELECT bdr.resolve_failed_transaction(
        'bdr_mydb_mygroup_node1_node2',  -- subscription_name
        '12345'::xid,                    -- remote_xid
        '0/1234567'::pg_lsn,             -- remote_commit_lsn
        'skipped',                       -- resolution: 'skipped', 'applied', or 'discarded'
        true                             -- re_enable
    );

    Alternatively, pass failed_txn_id (the UUID from bdr.pending_failed_changes) instead of the three-part identifier.

  5. Execute the resolution by passing dry_run := false:

    SELECT bdr.resolve_failed_transaction(
        'bdr_mydb_mygroup_node1_node2',
        '12345'::xid,
        '0/1234567'::pg_lsn,
        'skipped',
        true,   -- re_enable
        false   -- dry_run
    );

    To resolve without immediately re-enabling, pass re_enable := false and re-enable manually when ready:

    SELECT bdr.alter_subscription_enable('bdr_mydb_mygroup_node1_node2');
  6. Reset the retry and skip counters so the policy thresholds apply fresh to future failures:

    SELECT bdr.alter_subscription_reset_error_counters('bdr_mydb_mygroup_node1_node2');
  7. Once you've confirmed resolution, delete the resolved rows from bdr.pending_failed_changes to reclaim space. The resolution summary is preserved in bdr.resolved_transactions.

    DELETE FROM bdr.pending_failed_changes
    WHERE change_status IN ('skipped', 'applied');

Resolving individual changes

When a failed transaction contains a mix of changes, some safe to resolve and others not, use bdr.skip_failed_change and bdr.apply_failed_change to operate on individual rows rather than the transaction as a whole. Once you've resolved all changes individually, call bdr.resolve_failed_transaction to advance the subscription position and re-enable it.

Handling truncated captures

When a failed transaction exceeds apply_error_max_record_size, PGD records as many changes as it can and sets capture_truncated = true in bdr.pending_failed_changes. Truncated transactions can't be reapplied using bdr.apply_failed_transaction because the record is incomplete.

To resolve a truncated transaction, use bdr.reconstruct_failed_transaction to review the partial SQL, manually apply the complete transaction on the target node using that output as a reference, then use bdr.skip_failed_transaction to mark the captured changes as skipped and re-enable the subscription.

To prevent truncation on large transactions, increase apply_error_max_record_size before the next failure.

Monitoring for accumulated failures

With record_all_and_skip, the subscription stays enabled and replication continues, but failed transactions accumulate silently. Check bdr.subscription_error_status regularly to catch them:

SELECT sub_name, total_skips, pending_changes
FROM bdr.subscription_error_status;

When pending_changes is non-zero, review and resolve the accumulated failures using the same steps in Responding to a disabled subscription, skipping the re-enable step since the subscription is still active.

To review past resolutions:

SELECT subscription_name, origin_node, remote_xid, resolution,
       insert_count, update_count, delete_count, error_message, resolved_at
FROM bdr.resolved_transactions_info
ORDER BY resolved_at DESC;