Column-level conflict detection v4
By default, conflicts are resolved at row level. That is, when changes from two nodes conflict, we pick either the local or remote tuple and discard the other one. For example, we might compare commit timestamps for the two conflicting changes and keep the newer one. This ensures that all nodes converge to the same result and establishes commit-order-like semantics on the whole cluster.
However, in some cases it might be appropriate to resolve conflicts at the column level rather than the row level.
Consider a simple example, where we have a table t with two integer
columns a and b and a single row
(1,1). Assume that on one node
On another node we concurrently (before receiving the preceding
This results in an
UPDATE-UPDATE conflict. With the
conflict resolution, we compare the commit timestamps and keep the new
row version. Assuming the second node committed last, we end up with
(1,100), effectively discarding the change to column a.
For many use cases, this is the desired and expected behavior, but for some this might be an issue. Consider, for example, a multi-node cluster where each part of the application is connected to a different node, updating a dedicated subset of columns in a shared table. In that case, the different components might step on each other's toes, overwriting their changes.
For such use cases, it might be more appropriate to resolve conflicts on
a given table at the column level. To achieve that, BDR tracks
the timestamp of the last change for each column separately and uses that
to pick the most recent value (essentially
Applied to the previous example, we'll end up with
(100,100) on both
nodes, despite neither of the nodes ever seeing such a row.
When thinking about column-level conflict resolution, it can be useful to see tables as vertically partitioned, so that each update affects data in only one slice. This approach eliminates conflicts between changes to different subsets of columns. In fact, vertical partitioning can even be a practical alternative to column-level conflict resolution.
Column-level conflict resolution requires the table to have
REPLICA IDENTITY FULL. The
bdr.alter_table_conflict_detection function does check
that and fails with an error otherwise.
The column-level conflict resolution is managed by the bdr.alter_table_conflict_detection() function.
To see how the
bdr.alter_table_conflict_detection() is used, consider
this example that creates a trivial table
test_table and then enables
column-level conflict resolution on it:
The function adds a new
cts column (as specified in
the function call), but it also created two triggers (
BEFORE UPDATE) that are responsible for maintaining timestamps
in the new column before each change.
Also, the new column specifies
with a default value, which means that
ALTER TABLE ... ADD COLUMN
doesn't perform a table rewrite.
We discourage using columns with the
bdr.column_timestamps data type
for other purposes as it can have negative effects.
For example, it switches the table to column-level conflict resolution, which doesn't
work correctly without the triggers.
You can list tables having column-level conflict resolution enabled
with the following query. This query detects the presence of a column of
This function creates column-level conflict resolution. It's called within
p_source— The two options are
p_timestamp— Timestamp depends on the source chosen. If
When enabling or disabling column timestamps on a table, the code uses DDL locking to ensure that there are no pending changes from before the switch. This approach ensures we see only conflicts with timestamps in both tuples or in neither of them. Otherwise, the code might unexpectedly see timestamps in the local tuple and NULL in the remote one. It also ensures that the changes are resolved the same way (column-level or row-level) on all nodes.
An important decision is the timestamp to assign to modified columns.
By default, the timestamp assigned to modified columns is the current
timestamp, as if obtained from
clock_timestamp. This is simple, and
for many cases it is perfectly correct (for example, when the conflicting rows
modify non-overlapping subsets of columns).
It can, however, have various unexpected effects:
The timestamp changes during statement execution, so if an
UPDATEaffects multiple rows, each gets a slightly different timestamp. This means that the effects of concurrent changes might get "mixed" in various ways (depending on how exactly the changes performed on different nodes interleave).
The timestamp is unrelated to the commit timestamp, and using it to resolve conflicts means that the result isn't equivalent to the commit order, which means it likely can't be serialized.
We might add statement and transaction timestamps in the future, which would address issues with mixing effects of concurrent statements or transactions. Still, neither of these options can ever produce results equivalent to commit order.
It's possible to also use the actual commit timestamp, although this
feature is currently considered experimental. To use the commit timestamp,
set the last parameter to
true when enabling column-level conflict
You can disable it using
Commit timestamps currently have restrictions that are explained in Notes.
The column storing timestamps for modified columns is maintained automatically by triggers. Don't modify it directly. It can be useful to inspect the current timestamps value, for example, while investigating how a conflict was resolved.
Three functions are useful for this purpose:
This function returns a human-readable representation of the timestamp mapping and is used when casting the value to
This function turns a JSONB representation of the timestamps mapping and is used when casting the value to
This function updates the mapping with the commit timestamp for the attributes modified by the most recent transaction (if it already committed). This matters only when using the commit timestamp. For example, in this case, the last transaction updated the second attribute (with
attnum = 2):
By default, column-level conflict resolution picks the value with a higher timestamp and discards the other one. You can, however, reconcile the conflict in different, more elaborate ways. For example, you can use CRDT types that allow merging the conflicting values without discarding any information.
The attributes modified by an
UPDATEare determined by comparing the old and new row in a trigger. This means that if the attribute doesn't change a value, it isn't detected as modified even if it's explicitly set. For example,
UPDATE t SET a = adoesn't mark
aas modified for any row. Similarly,
UPDATE t SET a = 1doesn't mark
aas modified for rows that are already set to
INSERTstatements, we don't have any old row to compare the new one to, so we consider all attributes to be modified and assign them a new timestamp. This applies even for columns that weren't included in the
INSERTstatement and received default values. We can detect which attributes have a default value but can't know if it was included automatically or specified explicitly.
This effectively means column-level conflict resolution doesn't work for
INSERT-INSERTconflicts even if the
INSERTstatements specify different subsets of columns. The newer row has timestamps that are all newer than the older row.
By treating the columns independently, it's easy to violate constraints in a way that isn't possible when all changes happen on the same node. Consider, for example, a table like this:
Assume one node does:
Another node concurrently does:
Each of those updates is valid when executed on the initial row and
so passes on each node. But when replicating to the other node,
the resulting row violates the
CHECK (A > b) constraint, and the
replication stops until the issue is resolved manually.
The column storing timestamp mapping is managed automatically. Don't specify or override the value in your queries, as it can result in unpredictable effects. (We do ignore the value where possible anyway.)
The timestamp mapping is maintained by triggers, but the order in which triggers execute matters. So if you have custom triggers that modify tuples and are executed after the
pgl_clcd_triggers, the modified columns aren't detected correctly.
When using regular timestamps to order changes/commits, it's possible that the conflicting changes have exactly the same timestamp (because two or more nodes happened to generate the same timestamp). This risk isn't unique to column-level conflict resolution, as it can happen even for regular row-level conflict resolution. We use node id as a tie-breaker in this situation (the higher node id wins), which ensures that the same changes are applied on all nodes.
It is possible that there is a clock skew between different nodes. While it can induce somewhat unexpected behavior (discarding seemingly newer changes because the timestamps are inverted), you can manage clock skew between nodes using the parameters