When a replication apply failure occurs, PGD retries the failed transaction indefinitely by default. Persistent failures can overload the system with repeated apply attempts and fill logs with recurring error messages.
Use the apply_error_policy node group option to control what PGD does when a transaction fails to apply. Set it to stop retrying after a set number of attempts, record failed changes for review, and either disable the subscription or skip the failed transaction and continue replicating.
Choosing an error policy
Four error policies are available, configured at the node group level:
keep_retryingretries the failed transaction indefinitely. PGD keeps retrying silently, nothing is recorded, and no subscription is disabled. Use for transient failures expected to self-heal, such as temporary lock contention.keep_retryingis the default.disable_after_retriesretries up toapply_error_max_retriestimes, then disables the subscription and records the reason. Use when failures can't persist indefinitely but might still be transient.record_all_and_disablerecords the complete failed transaction, then disables the subscription. Use when you need a complete record of what failed before halting replication.record_all_and_skiprecords the complete failed transaction, skips it, and continues replicating. The subscription stays enabled, but failed changes accumulate silently inbdr.pending_failed_changes. Use for high-availability deployments where continuity matters more than guaranteeing every transaction is applied.
The record_all_and_disable and record_all_and_skip policies capture the full transaction by replaying it in capture-only mode before taking action, giving you a complete record to inspect and resolve later.
Configuring the error policy
Set the policy on the node group using bdr.alter_node_group_option. To find your node group name:
SELECT node_group_name FROM bdr.local_node_summary;
Apply your chosen policy:
SELECT bdr.alter_node_group_option( 'top_group', 'apply_error_policy', 'record_all_and_disable' );
Configure the supporting parameters as needed. For example:
-- Retry attempts before the policy takes action (default: 3) SELECT bdr.alter_node_group_option( 'top_group', 'apply_error_max_retries', '5' ); -- Maximum transactions record_all_and_skip can auto-skip before disabling. -- Set to -1 for unlimited. Default is NULL (unlimited). SELECT bdr.alter_node_group_option( 'top_group', 'apply_error_max_skips', '100' ); -- Maximum bytes to capture per failed transaction (default: 10485760, which is 10 MB) SELECT bdr.alter_node_group_option( 'top_group', 'apply_error_max_record_size', '52428800' );
For the full list of apply error group options, see bdr.alter_node_group_option.
Responding to a disabled subscription
When a subscription is disabled by the disable_after_retries or record_all_and_disable policy, replication stops. Follow these steps to investigate and resolve the failure.
Check which subscriptions are affected and why:
SELECT sub_name, disable_reason, effective_error_policy, pending_changes FROM bdr.subscription_error_status WHERE NOT sub_enabled;
The
disable_reasoncolumn records why PGD disabled the subscription, andpending_changesshows how many failed changes are waiting for review. For the full column reference, see bdr.subscription_error_status.Review the failed changes:
SELECT id, failed_txn_id, subscription_name, remote_xid, remote_commit_lsn, operation, qualified_table, sql_statement, error_message FROM bdr.pending_failed_changes_sql WHERE change_status = 'pending' ORDER BY id;
Optionally, analyze whether it's safe to skip.
bdr.format_skip_analysisclassifies the skip asSAFE,UNSAFE, orUNKNOWNand explains the reasoning:SELECT bdr.format_skip_analysis(id) FROM bdr.pending_failed_changes WHERE change_status = 'pending' LIMIT 1;
Preview the resolution using
bdr.resolve_failed_transaction, which defaults todry_run := true:SELECT bdr.resolve_failed_transaction( 'bdr_mydb_mygroup_node1_node2', -- subscription_name '12345'::xid, -- remote_xid '0/1234567'::pg_lsn, -- remote_commit_lsn 'skipped', -- resolution: 'skipped', 'applied', or 'discarded' true -- re_enable );
Alternatively, pass
failed_txn_id(the UUID frombdr.pending_failed_changes) instead of the three-part identifier.Execute the resolution by passing
dry_run := false:SELECT bdr.resolve_failed_transaction( 'bdr_mydb_mygroup_node1_node2', '12345'::xid, '0/1234567'::pg_lsn, 'skipped', true, -- re_enable false -- dry_run );
To resolve without immediately re-enabling, pass
re_enable := falseand re-enable manually when ready:SELECT bdr.alter_subscription_enable('bdr_mydb_mygroup_node1_node2');
Reset the retry and skip counters so the policy thresholds apply fresh to future failures:
SELECT bdr.alter_subscription_reset_error_counters('bdr_mydb_mygroup_node1_node2');
Once you've confirmed resolution, delete the resolved rows from
bdr.pending_failed_changesto reclaim space. The resolution summary is preserved inbdr.resolved_transactions.DELETE FROM bdr.pending_failed_changes WHERE change_status IN ('skipped', 'applied');
Resolving individual changes
When a failed transaction contains a mix of changes, some safe to resolve and others not, use bdr.skip_failed_change and bdr.apply_failed_change to operate on individual rows rather than the transaction as a whole. Once you've resolved all changes individually, call bdr.resolve_failed_transaction to advance the subscription position and re-enable it.
Handling truncated captures
When a failed transaction exceeds apply_error_max_record_size, PGD records as many changes as it can and sets capture_truncated = true in bdr.pending_failed_changes. Truncated transactions can't be reapplied using bdr.apply_failed_transaction because the record is incomplete.
To resolve a truncated transaction, use bdr.reconstruct_failed_transaction to review the partial SQL, manually apply the complete transaction on the target node using that output as a reference, then use bdr.skip_failed_transaction to mark the captured changes as skipped and re-enable the subscription.
To prevent truncation on large transactions, increase apply_error_max_record_size before the next failure.
Monitoring for accumulated failures
With record_all_and_skip, the subscription stays enabled and replication continues, but failed transactions accumulate silently. Check bdr.subscription_error_status regularly to catch them:
SELECT sub_name, total_skips, pending_changes FROM bdr.subscription_error_status;
When pending_changes is non-zero, review and resolve the accumulated failures using the same steps in Responding to a disabled subscription, skipping the re-enable step since the subscription is still active.
To review past resolutions:
SELECT subscription_name, origin_node, remote_xid, resolution, insert_count, update_count, delete_count, error_message, resolved_at FROM bdr.resolved_transactions_info ORDER BY resolved_at DESC;