Application use v4

Learn about the application from a user perspective.

Application behavior

BDR supports replicating changes made on one node to other nodes.

BDRs, by default, replicate all changes from INSERT, UPDATE, DELETE and TRUNCATE operations from the source node to other nodes. Only the final changes are sent, after all triggers and rules are processed. For example, INSERT ... ON CONFLICT UPDATE sends either an insert or an update depending on what occurred on the origin. If an update or delete affects zero rows, then no changes are sent.

INSERT can be replicated without any preconditions.

For updates and deletes to replicate on other nodes, we must be able to identify the unique rows affected. BDR requires that a table have either a PRIMARY KEY defined, a UNIQUE constraint, or an explicit REPLICA IDENTITY defined on specific columns. If one of those isn't defined, a warning is generated, and later updates or deletes are explicitly blocked. If REPLICA IDENTITY FULL is defined for a table, then a unique index isn't required. In that case, updates and deletes are allowed and use the first non-unique index that is live, valid, not deferred, and doesn't have expressions or WHERE clauses. Otherwise, a sequential scan is used.

You can use TRUNCATE even without a defined replication identity. Replication of TRUNCATE commands is supported, but take care when truncating groups of tables connected by foreign keys. When replicating a truncate action, the subscriber truncates the same group of tables that was truncated on the origin, either explicitly specified or implicitly collected by CASCADE, except in cases where replication sets are defined. See Replication sets for further details and examples. This works correctly if all affected tables are part of the same subscription. But if some tables to be truncated on the subscriber have foreign-key links to tables that aren't part of the same (or any) replication set, then applying the truncate action on the subscriber fails.

Row-level locks taken implicitly by INSERT, UPDATE, and DELETE commands are replicated as the changes are made. Table-level locks taken implicitly by INSERT, UPDATE, DELETE, and TRUNCATE commands are also replicated. Explicit row-level locking (SELECT ... FOR UPDATE/FOR SHARE) by user sessions isn't replicated, nor are advisory locks. Information stored by transactions running in SERIALIZABLE mode isn't replicated to other nodes. The transaction isolation level of SERIALIAZABLE is supported, but transactions aren't serialized across nodes in the presence of concurrent transactions on multiple nodes.

If DML is executed on multiple nodes concurrently, then potential conflicts might occur if executing with asynchronous replication. These must be either handled or avoided. Various avoidance mechanisms are possible, discussed in Conflicts.

Sequences need special handling, described in Sequences.

Binary data in BYTEA columns is replicated normally, allowing "blobs" of data up to 1 GB in size. Use of the PostgreSQL "large object" facility isn't supported in BDR.

Rules execute only on the origin node so aren't executed during apply, even if they're enabled for replicas.

Replication is possible only from base tables to base tables. That is, the tables on the source and target on the subscription side must be tables, not views, materialized views, or foreign tables. Attempts to replicate tables other than base tables result in an error. DML changes that are made through updatable views are resolved to base tables on the origin and then applied to the same base table name on the target.

BDR supports partitioned tables transparently, meaning that a partitioned table can be added to a replication set and changes that involve any of the partitions are replicated downstream.

By default, triggers execute only on the origin node. For example, an INSERT trigger executes on the origin node and is ignored when you apply the change on the target node. You can specify for triggers to execute on both the origin node at execution time and on the target when it's replicated ("apply time") by using ALTER TABLE ... ENABLE ALWAYS TRIGGER, or use the REPLICA option to execute only at apply time: ALTER TABLE ... ENABLE REPLICA TRIGGER.

Some types of trigger aren't executed on apply, even if they exist on a table and are currently enabled. Trigger types not executed are:

  • Statement-level triggers (FOR EACH STATEMENT)
  • Per-column UPDATE triggers (UPDATE OF column_name [, ...])

BDR replication apply uses the system-level default search_path. Replica triggers, stream triggers, and index expression functions can assume other search_path settings that then fail when they execute on apply. To prevent this from occurring, resolve object references clearly using either only the default search_path, always use fully qualified references to objects, e.g., schema.objectname, or set the search path for a function using ALTER FUNCTION ... SET search_path = ... for the functions affected.

BDR assumes that there are no issues related to text or other collatable datatypes, i.e., all collations in use are available on all nodes, and the default collation is the same on all nodes. Replication of changes uses equality searches to locate Replica Identity values, so this does't have any effect except where unique indexes are explicitly defined with nonmatching collation qualifiers. Row filters might be affected by differences in collations if collatable expressions were used.

BDR handling of very long "toasted" data in PostgreSQL is transparent to the user. The TOAST "chunkid" values likely differ between the same row on different nodes, but that doesn't cause any problems.

BDR can't work correctly if Replica Identity columns are marked as external.

PostgreSQL allows CHECK() constraints that contain volatile functions. Since BDR re-executes CHECK() constraints on apply, any subsequent re-execution that doesn't return the same result as previously causes data divergence.

BDR doesn't restrict the use of foreign keys. Cascading FKs are allowed.

Nonreplicated statements

None of the following user commands are replicated by BDR, so their effects occur on the local/origin node only:

  • Cursor operations (DECLARE, CLOSE, FETCH)
  • Execution commands (DO, CALL, PREPARE, EXECUTE, EXPLAIN)
  • Session management (DEALLOCATE, DISCARD, LOAD)
  • Parameter commands (SET, SHOW)
  • Constraint manipulation (SET CONSTRAINTS)
  • Locking commands (LOCK)
  • Table maintenance commands (VACUUM, ANALYZE, CLUSTER, REINDEX)
  • Async operations (NOTIFY, LISTEN, UNLISTEN)

Since the NOTIFY SQL command and the pg_notify() functions aren't replicated, notifications aren't reliable in case of failover. This means that notifications can easily be lost at failover if a transaction is committed just when the server crashes. Applications running LISTEN might miss notifications in case of failover. This is true in standard PostgreSQL replication, and BDR doesn't yet improve on this. CAMO and Eager Replication options don't allow the NOTIFY SQL command or the pg_notify() function.

DML and DDL replication

BDR doesn't replicate the DML statement. It replicates the changes caused by the DML statement. For example, an UPDATE that changed two rows replicates two changes, whereas a DELETE that didn't remove any rows doesn't replicate anything. This means that the results of executing volatile statements are replicated, ensuring there's no divergence between nodes as might occur with statement-based replication.

DDL replication works differently to DML. For DDL, BDR replicates the statement, which then executes on all nodes. So a DROP TABLE IF EXISTS might not replicate anything on the local node, but the statement is still sent to other nodes for execution if DDL replication is enabled. Full details are covered in DDL replication.

BDR works to ensure that intermixed DML and DDL statements work correctly, even in the same transaction.

Replicating between different release levels

BDR is designed to replicate between nodes that have different major versions of PostgreSQL. This feature is designed to allow major version upgrades without downtime.

BDR is also designed to replicate between nodes that have different versions of BDR software. This feature is designed to allow version upgrades and maintenance without downtime.

However, while it's possible to join a node with a major version in a cluster, you can't add a node with a minor version if the cluster uses a newer protocol version. This returns an error.

Both of these features might be affected by specific restrictions. See Release notes for any known incompatibilities.

Replicating between nodes with differences

By default, DDL is automatically sent to all nodes. You can control this manually, as described in DDL Replication, and you could use it to create differences between database schemas across nodes. BDR is designed to allow replication to continue even with minor differences between nodes. These features are designed to allow application schema migration without downtime or to allow logical standby nodes for reporting or testing.

Currently, replication requires the same table name on all nodes. A future feature might allow a mapping between different table names.

It is possible to replicate between tables with dissimilar partitioning definitions, such as a source that is a normal table replicating to a partitioned table, including support for updates that change partitions on the target. It can be faster if the partitioning definition is the same on the source and target since dynamic partition routing doesn't need to execute at apply time. For details, see Replication sets.

By default, all columns are replicated. BDR replicates data columns based on the column name. If a column has the same name but a different datatype, we attempt to cast from the source type to the target type, if casts were defined that allow that.

BDR supports replicating between tables that have a different number of columns.

If the target has missing columns from the source, then BDR raises a target_column_missing conflict, for which the default conflict resolver is ignore_if_null. This throws an error if a non-NULL value arrives. Alternatively, you can also configure a node with a conflict resolver of ignore. This setting doesn't throw an error but silently ignores any additional columns.

If the target has additional columns not seen in the source record, then BDR raises a source_column_missing conflict, for which the default conflict resolver is use_default_value. Replication proceeds if the additional columns have a default, either NULL (if nullable) or a default expression, but throws an error and halts replication if not.

Transform triggers can also be used on tables to provide default values or alter the incoming data in various ways before apply.

If the source and the target have different constraints, then replication is attempted, but it might fail if the rows from source can't be applied to the target. Row filters can help here.

Replicating data from one schema to a more relaxed schema won't cause failures. Replicating data from a schema to a more restrictive schema can be a source of potential failures. The right way to solve this is to place a constraint on the more relaxed side, so bad data can't be entered. That way, no bad data ever arrives by replication, so it never fails the transform into the more restrictive schema. For example, if one schema has a column of type TEXT and another schema defines the same column as XML, add a CHECK constraint onto the TEXT column to enforce that the text is XML.

You can define a table with different indexes on each node. By default, the index definitions are replicated. See DDL replication to specify how to create an index only on a subset of nodes or just locally.

Storage parameters, such as fillfactor and toast_tuple_target, can differ between nodes for a table without problems. An exception to that is the value of a table's storage parameter user_catalog_table must be identical on all nodes.

A table being replicated must be owned by the same user/role on each node. See Security and roles for further discussion.

Roles can have different passwords for connection on each node, although by default changes to roles are replicated to each node. See DDL replication to specify how to alter a role password only on a subset of nodes or locally.

Comparison between nodes with differences

LiveCompare is a tool for data comparison on a database, against BDR and non-BDR nodes. It needs a minimum of two connections to compare against and reach a final result.

Since LiveCompare 1.3, you can configure with all_bdr_nodes set. This saves you from clarifying all the relevant DSNs for each separate node in the cluster. A BDR cluster has N amount of nodes with connection information, but it's only the initial and output connection that LiveCompare 1.3+ needs to complete its job. Setting logical_replication_mode states how all the nodes are communicating.

All the configuration is done in a .ini file, named bdrLC.ini, for example. Find templates for this configuration file in /etc/2ndq-livecompare/.

While LiveCompare executes, you see N+1 progress bars, N being the number of processes. Once all the tables are sourced, a time displays, as the transactions per second (tps) was measured. This continues to count the time, giving you an estimate and then a total execution time at the end.

This tool offers a lot of customization and filters, such as tables, schemas, and replication_sets. LiveCompare can use stop-start without losing context information, so it can run at convenient times. After the comparison, a summary and a DML script are generated so you can review it. Apply the DML to fix the any differences found.

General rules for applications

BDR uses replica identity values to identify the rows to change. Applications can cause difficulties if they insert, delete, and then later reuse the same unique identifiers. This is known as the ABA problem. BDR can't know whether the rows are the current row, the last row, or much older rows.

Similarly, since BDR uses table names to identify the table against which changes are replayed, a similar ABA problem exists with applications that create, drop, and then later reuse the same object names.

These issues give rise to some simple rules for applications to follow:

  • Use unique identifiers for rows (INSERT).
  • Avoid modifying unique identifiers (UPDATE).
  • Avoid reusing deleted unique identifiers.
  • Avoid reusing dropped object names.

In the general case, breaking those rules can lead to data anomalies and divergence. Applications can break those rules as long as certain conditions are met, but use caution: while anomalies are unlikely, they aren't impossible. For example, a row value can be reused as long as the DELETE was replayed on all nodes, including down nodes. This might normally occur in less than a second but can take days if a severe issue occurred on one node that prevented it from restarting correctly.

Timing considerations and synchronous replication

Being asynchronous by default, peer nodes might lag behind, making it possible for a client connected to multiple BDR nodes or switching between them to read stale data.

A queue wait function is provided for clients or proxies to prevent such stale reads.

The synchronous replication features of Postgres are available to BDR as well. In addition, BDR provides multiple variants for more synchronous replication. See Durability and performance options for an overview and comparison of all variants available and its different modes.

Application testing

You can test BDR applications using the following programs, in addition to other techniques.

TPAexec

TPAexec is the system used by EDB to deploy reference TPA architectures, including those based on Postgres-BDR.

TPAexec includes test suites for each reference architecture. It also simplifies creating and managing a local collection of tests to run against a TPA cluster, using a syntax like the following:

tpaexec test mycluster mytest

We strongly recommend that developers write their own multi-node suite of TPAexec tests that verify the main expected properties of the application.

pgbench with CAMO/Failover options

In EDB Postgres Extended, the pgbench was extended to allow users to run failover tests while using CAMO or regular BDR deployments. The following options were added:

-m, --mode=regular|camo|failover
mode in which pgbench should run (default: regular)

--retry
retry transactions on failover

In addition to these options, the connection information about the peer node for failover must be specified in DSN form.

  • Use -m camo or -m failover to specify the mode for pgbench. You can use The -m failover specification to test failover in regular BDR deployments.

  • Use --retry to specify whether to retry transactions when failover happens with -m failover mode. This option is enabled by default for -m camo mode.

Here's an example in a CAMO environment:

    pgbench -m camo -p $node1_port -h $node1_host bdrdemo \
        "host=$node2_host user=postgres port=$node2_port dbname=bdrdemo"

This command runs in camo mode. It connects to node1 and runs the tests. If the connection to node1 is lost, then pgbench connects to node2. It queries node2 to get the status of in-flight transactions. Aborted and in-flight transactions are retried in camo mode.

In failover mode, if --retry is specified, then in-flight transactions are retried. In this scenario there's no way to find the status of in-flight transactions.

isolationtester with multi-node access

isolationtester was extended to allow users to run tests on multiple sessions and on multiple nodes. This is used for internal BDR testing, although it's also available for use with user application testing.

$ isolationtester \
     --outputdir=./iso_output \
     --create-role=logical \
     --dbname=postgres \
     --server 'd1=dbname=node1' \
     --server 'd2=dbname=node2' \
     --server 'd3=dbname=node3'

Isolation tests are a set of tests for examining concurrent behaviors in PostgreSQL. These tests require running multiple interacting transactions, which requires managing multiple concurrent connections and therefore can't be tested using the normal pg_regress program. The name "isolation" comes from the fact that the original motivation was to test the serializable isolation level. Tests for other sorts of concurrent behaviors were added as well.

It's built using PGXS as an external module. On installation, it creates the isolationtester binary file, which is run by pg_isolation_regress to perform concurrent regression tests and observe results.

pg_isolation_regress is a tool similar to pg_regress, but instead of using psql to execute a test, it uses isolationtester. It accepts all the same command-line arguments as pg_regress. It was modified to accept multiple hosts as parameters. It then passes the host conninfo along with server names to the isolationtester binary. Isolation tester compares these server names with the names specified in each session in the spec files and runs given tests on respective servers.

To define tests with overlapping transactions, we use test specification files with a custom syntax. To add a new test, place a spec file in the specs/ subdirectory, add the expected output in the expected/ subdirectory, and add the test's name to the makefile.

Isolationtester is a program that uses libpq to open multiple connections and executes a test specified by a spec file. A libpq connection string specifies the server and database to connect to. Defaults derived from environment variables are used otherwise.

Specification consists of five parts, tested in this order:

server "<name>"

This defines the name of the servers for the sessions to run on. There can be zero or more server "<name>" specifications. The conninfo corresponding to the names is provided by the command to run isolationtester. This is described in quickstart_isolationtest.md. This part is optional.

setup { <SQL> }

The given SQL block is executed once, in one session only, before running the test. Create any test tables or other required objects here. This part is optional. Multiple setup blocks are allowed if needed. Each is run separately, in the given order. The reason for allowing multiple setup blocks is that each block is run as a single PQexec submission, and some statements such as VACUUM can't be combined with others in such a block.

teardown { <SQL> }

The teardown SQL block is executed once after the test is finished. Use this to clean up in preparation for the next permutation, such as dropping any test tables created by setup. This part is optional.

session "<name>"

There are normally several "session" parts in a spec file. Each session is executed in its own connection. A session part consists of three parts: setup, teardown, and one or more "steps." The per-session setup and teardown parts have the same syntax as the per-test setup and teardown, but they are executed in each session. The setup part typically contains a BEGIN command to begin a transaction.

A session part also consists of connect_to specification. This points to a server name specified in the beginning that indicates the server on which this session runs.

connect_to "<name>"

Each step has the syntax:

step "<name>" { <SQL> }

where <name> is a name identifying this step, and SQL is a SQL statement (or statements, separated by semicolons) that's executed in the step. Step names must be unique across the whole spec file.

permutation "<step name>"

A permutation line specifies a list of steps that are run in that order. Any number of permutation lines can appear. If no permutation lines are given, the test program automatically generates all possible orderings of the steps from each session (running the steps of any one session in order). The list of steps in a manually specified "permutation" line doesn't actually have to be a permutation of the available steps. It can, for instance, repeat some steps more than once or leave others out.

Lines beginning with a # are comments.

For each permutation of the session steps (whether these are manually specified in the spec file or automatically generated), the isolation tester runs:

  1. The main setup part
  2. Per-session setup parts
  3. The selected session steps
  4. Per-session teardown
  5. The main teardown script

Each selected step is sent to the connection associated with its session.

To run isolation tests in a BDR environment that ran all prerequisite make commands:

  1. Run make isolationcheck-install to install the isolationtester submodule.

  2. You can run isolation regression tests using either of the following commands from the bdr-private repo:

    make isolationcheck-installcheck make isolationcheck-makecheck

To run isolationcheck-installcheck, you need to have two or more postgresql servers running. Pass the conninfo of each server to pg_isolation_regress in the BDR makefile. Ex: pg_isolation_regress --server 'd1=host=myhost dbname=mydb port=5434' --server 'd2=host=myhost1 dbname=mydb port=5432'

Next, add a .spec file containing tests in the specs/isolation directory of the bdr-private/ repo. Add a .out file in expected/isolation directory of the nbdr-private/ repo.

Then run make isolationcheck-installcheck

Isolationcheck-makecheck currently supports running isolation tests on a single instance by setting up BDR between multiple databases.

You need to pass appropriate database names and the conninfo of bdr instances to pg_isolation_regress in the BDR makefile as follows: pg_isolation_regress --dbname=db1,db2 --server 'd1=dbname=db1' --server 'd2=dbname=db2'

Then run make isolationcheck-makecheck

Each step can contain commands that block until further action has been taken (most likely, some other session runs a step that unblocks it or causes a deadlock). A test that uses this ability must manually specify valid permutations, that is, those that don'tt expect a blocked session to execute a command. If a test doesn't follow that rule, isolationtester cancels it after 300 seconds. If the cancel doesn't work, isolationtester exits uncleanly after 375 seconds of wait time. Avoid testing invalid permutations because they can make the isolation tests take a very long time to run, and they serve no useful testing purpose.

isolationtester recognizes that a command has blocked by checking whether it is shown as waiting in the pg_locks view. Therefore, only blocks on heavyweight locks are detected.

Performance testing and tuning

BDR allows you to issue write transactions onto multiple master nodes. Bringing those writes back together onto each node has a cost in performance.

First, replaying changes from another node has a CPU cost, an I/O cost, and it generates WAL records. The resource use is usually less than in the original transaction since CPU overheads are lower as a result of not needing to reexecute SQL. In the case of UPDATE and DELETE transactions, there might be I/O costs on replay if data isn't cached.

Second, replaying changes holds table-level and row-level locks that can produce contention against local workloads. The conflict-free replicated data types (CRDT) and column-level conflict detection (CLCD) features ensure you get the correct answers even for concurrent updates, but they don't remove the normal locking overheads. If you get locking contention, try to avoid conflicting updates or keep transactions as short as possible. A heavily updated row in a larger transaction causes a bottleneck on performance for that transaction. Complex applications require some thought to maintain scalability.

If you think you're having performance problems, develop performance tests using the benchmarking tools. pgbench allows you to write custom test scripts specific to your use case so you can understand the overheads of your SQL and measure the impact of concurrent execution.

If BDR is running slow, then we suggest the following:

  1. Write a custom test script for pgbench, as close as you can make it to the production system's problem case.
  2. Run the script on one node to give you a baseline figure.
  3. Run the script on as many nodes as occurs in production, using the same number of sessions in total as you did on one node. This shows you the effect of moving to multiple nodes.
  4. Increase the number of sessions for these two tests so you can plot the effect of increased contention on your application.
  5. Make sure your tests are long enough to account for replication delays.
  6. Ensure that replication delay isn't growing during your tests.

Use all of the normal Postgres tuning features to improve the speed of critical parts of your application.

Assessing suitability

BDR is compatible with PostgreSQL, but not all PostgreSQL applications are suitable for use on distributed databases. Most applications are already or can easily be modified to become BDR compliant. You can undertake an assessment activity in which you can point your application to a BDR-enabled setup. BDR provides a few knobs that can be set during the assessment period. These aid in the process of deciding suitability of your application in a BDR-enabled environment.

Assessing updates of primary key/replica identity

BDR can't currently perform conflict resolution where the PRIMARY KEY is changed by an UPDATE operation. You can update the primary key, but you must ensure that no conflict with existing values is possible.

BDR provides the following configuration parameter to assess how frequently the primary key/replica identity of any table is being subjected to update operations.

Use these configuration parameters only for assessment. You can use them on a single node BDR instance, but don't use them on a production BDR cluster with two or more nodes replicating to each other. In fact, a node might fail to start or a new node fail to join the cluster if any of the assessment parameters are set to anything other than IGNORE.

bdr.assess_update_replica_identity = IGNORE (default) | LOG | WARNING | ERROR

By enabling this parameter during the assessment period, you can log updates to the key/replica identity values of a row. You can also potentially block such updates, if desired. For example:

CREATE TABLE public.test(g int primary key, h int);
INSERT INTO test VALUES (1, 1);

SET bdr.assess_update_replica_identity TO 'error';
UPDATE test SET g = 4 WHERE g = 1;
ERROR:  bdr_assess: update of key/replica identity of table public.test

Apply worker processes always ignore any settings for this parameter.

Assessing use of LOCK on tables or in SELECT queries

Because BDR writer processes operate much like normal user sessions, they're subject to the usual rules around row and table locking. This can sometimes lead to BDR writer processes waiting on locks held by user transactions or even by each other.

BDR provides the following configuration parameter to assess if the application is taking explicit locks:

bdr.assess_lock_statement = IGNORE (default) | LOG | WARNING | ERROR

Two types of locks that you can track are:

  • Explicit table-level locking (LOCK TABLE ...) by user sessions
  • Explicit row-level locking (SELECT ... FOR UPDATE/FOR SHARE) by user sessions

By enabling this parameter during the assessment period, you can track (or block) such explicit locking activity. For example:

CREATE TABLE public.test(g int primary key, h int);
INSERT INTO test VALUES (1, 1);

SET bdr.assess_lock_statement TO 'error';
SELECT * FROM test FOR UPDATE;
ERROR:  bdr_assess: "SELECT FOR UPDATE" invoked on a BDR node

SELECT * FROM test FOR SHARE;
ERROR:  bdr_assess: "SELECT FOR SHARE" invoked on a BDR node

SET bdr.assess_lock_statement TO 'warning';
LOCK TABLE test IN ACCESS SHARE MODE;
WARNING:  bdr_assess: "LOCK STATEMENT" invoked on a BDR node