Sequences v3.7

Many applications require that unique surrogate ids be assigned to database entries. Often the database SEQUENCE object is used to produce these. In PostgreSQL these can be either a manually created sequence using the CREATE SEQUENCE command and retrieved by calling nextval() function, or serial and bigserial columns or alternatively GENERATED BY DEFAULT AS IDENTITY columns.

However, standard sequences in PostgreSQL are not multi-node aware, and only produce values that are unique on the local node. This is important because unique ids generated by such sequences will cause conflict and data loss (by means of discarded INSERTs) in multi-master replication.

BDR Global Sequences

For this reason, BDR provides an application-transparent way to generate unique ids using sequences on bigint or bigserial datatypes across the whole BDR group, called global sequences.

BDR global sequences provide an easy way for applications to use the database to generate unique synthetic keys in an asynchronous distributed system that works for most - but not necessarily all - cases.

Using BDR global sequences allows you to avoid the problems with insert conflicts. If you define a PRIMARY KEY or UNIQUE constraint on a column which is using a global sequence, it is not possible for any node to ever get the same value as any other node. When BDR synchronizes inserts between the nodes, they can never conflict.

BDR global sequences extend PostgreSQL sequences, so are crash-safe. To use them, you must have been granted the bdr_application role.

There are various possible algorithms for global sequences:

  • Timeshard sequences
  • Globally-allocated range sequences

Timeshard sequences generate values using an algorithm that does not require inter-node communication at any point, so is faster and more robust, as well as having the useful property of recording the timestamp at which they were created. Timeshard sequences have the restriction that they work only for 64-bit BIGINT datatypes and produce values 19 digits long, which may be too long for use in some host language datatypes such as Javascript Integer types. Globally-allocated sequences allocate a local range of values which can be replenished as-needed by inter-node consensus, making them suitable for either BIGINT or INTEGER sequences.

A global sequence can be created using the bdr.alter_sequence_set_kind() function. This function takes a standard PostgreSQL sequence and marks it as a BDR global sequence. It can also convert the sequence back to the standard PostgreSQL sequence (see below).

BDR also provides the configuration variable bdr.default_sequence_kind, which determines what kind of sequence will be created when the CREATE SEQUENCE command is executed or when a serial, bigserial or GENERATED BY DEFAULT AS IDENTITY column is created. Valid settings are:

  • local (the default) meaning that newly created sequences are the standard PostgreSQL (local) sequences.
  • galloc which always creates globally-allocated range sequences.
  • timeshard which creates time-sharded global sequences for BIGINT sequences, but will throw ERRORs when used with INTEGER sequences.

The bdr.sequences view shows information about individual sequence kinds.

currval() and lastval() work correctly for all types of global sequence.

Timeshard Sequences

The ids generated by timeshard sequences are loosely time-ordered so they can be used to get the approximate order of data insertion, like standard PostgreSQL sequences. Values generated within the same millisecond might be out of order, even on one node. The property of loose time-ordering means they are suitable for use as range partition keys.

Timeshard sequences work on one or more nodes, and do not require any inter-node communication after the node join process completes. So they may continue to be used even if there's the risk of extended network partitions, and are not affected by replication lag or inter-node latency.

Timeshard sequences generate unique ids in a different way to standard sequences. The algorithm uses 3 components for a sequence number. The first component of the sequence is a timestamp at the time of sequence number generation. The second component of the sequence number is the unique id assigned to each BDR node, which ensures that the ids from different nodes will always be different. Finally, the third component is the number generated by the local sequence itself.

While adding a unique node id to the sequence number would be enough to ensure there are no conflicts, we also want to keep another useful property of sequences, which is that the ordering of the sequence numbers roughly corresponds to the order in which data was inserted into the table. Putting the timestamp first ensures this.

A few limitations and caveats apply to timeshard sequences.

Timeshard sequences are 64-bits wide and need a bigint or bigserial. Values generated will be at least 19 digits long. There is no practical 32-bit integer version, so cannot be used with serial sequences - use globally-allocated range sequences instead.

There is a limit of 8192 sequence values generated per millisecond on any given node for any given sequence. If more than 8192 sequences per millisecond are generated from one sequence on one node, the generated values will wrap around and could collide. There is no check on that for performance reasons; the value is not reset to 0 at the start of each ms. Collision will usually result in a UNIQUE constraint violation on INSERT or UPDATE. It cannot cause a replication conflict, because sequence values generated on different nodes cannot ever collide since they contain the nodeid.

In practice this is harmless; values are not generated fast enough to trigger this limitation as there will be other work being done, rows inserted, indexes updated, etc. Despite that, applications should have a UNIQUE constraint in place where they absolutely rely on a lack of collisions.

Perhaps more importantly, the timestamp component will run out of values in the year 2050, and if used in combination with bigint, the values will wrap to negative numbers in the year 2033. This means that sequences generated after 2033 will have negative values. If you plan to deploy your application beyond this date, try one of [UUIDs, KSUUIDs and Other Approaches] mentioned below, or use globally-allocated range sequences instead.

The INCREMENT option on a sequence used as input for timeshard sequences is effectively ignored. This could be relevant for applications that do sequence ID caching, like many object-relational mapper (ORM) tools, notably Hibernate. Because the sequence is time-based, this has little practical effect since the sequence will have advanced to a new non-colliding value by the time the application can do anything with the cached values.

Similarly, the START, MINVALUE, MAXVALUE and CACHE settings may be changed on the underlying sequence, but there is no benefit to doing so. The sequence's low 14 bits are used and the rest is discarded, so the value range limits do not affect the function's result. For the same reason, setval() is not useful for timeshard sequences.

Globally-allocated range Sequences

The globally-allocated range (or galloc) sequences allocate ranges (chunks) of values to each node. When the local range is used up, a new range is allocated globally by consensus amongst the other nodes. This uses the key space efficiently, but requires that the local node be connected to a majority of the nodes in the cluster for the sequence generator to progress, when the currently assigned local range has been used up.

Unlike timeshard sequences, galloc sequences support all sequence data types provided by PostgreSQL - smallint, integer and bigint. This means that galloc sequences can be used in environments where 64-bit sequences are problematic, such as using integers in javascript, since that supports only 53-bit values, or when the sequence is displayed on output with limited space.

The range assigned by each voting is currently predetermined based on the datatype the sequence is using:

  • smallint - 1 000 numbers
  • integer - 1 000 000 numbers
  • bigint - 1 000 000 000 numbers

Each node will allocate two chunks of seq_chunk_size, one for the current use plus a reserved chunk for future usage, so the values generated from any one node will increase monotonically. However, viewed globally, the values generated will not be ordered at all. This could cause a loss of performance due to the effects on b-tree indexes, and will typically mean that generated values will not be useful as range partition keys.

The main downside of the galloc sequences is that once the assigned range is used up, the sequence generator has to ask for consensus about the next range for the local node that requires inter-node communication, which could lead to delays or operational issues if the majority of the BDR group is not accessible. This may be avoided in later releases.

The CACHE, START, MINVALUE and MAXVALUE options work correctly with galloc sequences, however you need to set them before transforming the sequence to galloc kind. The INCREMENT BY option also works correctly, however, you cannot assign an increment value which is equal to or more than the above ranges assigned for each sequence datatype. setval() does not reset the global state for galloc sequences and should not be used.

A few limitations apply to galloc sequences. BDR tracks galloc sequences in a special BDR catalog bdr.sequence_alloc. This catalog is required to track the currently allocated chunks for the galloc sequences. The sequence name and namespace is stored in this catalog. Since the sequence chunk allocation is managed via Raft whereas any changes to the sequence name/namespace is managed via replication stream, BDR currently does not support renaming galloc sequences, or moving them to another namespace or renaming the namespace that contains a galloc sequence. The user should be mindful of this limitation while designing application schema.

Usage

Before transforming a local sequence to galloc, you need to take care of these prerequisites:

When sequence kind is altered to galloc, it will be rewritten and restart from the defined start value of the local sequence. If this happens on an existing sequence in a production database you will need to query the current value then set the start value appropriately. To assist with this use case, BDR allows users to pass a starting value with the function bdr.alter_sequence_set_kind(). If you are already using offset and you have writes from multiple nodes, you need to check what is the greatest used value and restart the sequence at least to the next value.

-- determine highest sequence value across all nodes
SELECT max((x->'response'->0->>'nextval')::bigint)
    FROM json_array_elements(
        bdr.run_on_all_nodes(
            E'SELECT nextval(\'public.sequence\');'
            )::jsonb AS x;

-- turn into a galloc sequence
SELECT bdr.alter_sequence_set_kind('public.sequence'::regclass, 'galloc', $MAX+MARGIN);

Since users cannot lock a sequence, you must leave a $MARGIN value to allow operations to continue while the max() value is queried.

The bdr.sequence_alloc table will give information on the chunk size and what ranges are allocated around the whole cluster. In this example we started our sequence from 333, and we have two nodes in the cluster, we can see that we have a number of allocation 4, that is 2 per node and the chunk size is 1000000 that is related to an integer sequence.

SELECT * FROM bdr.sequence_alloc
    WHERE seqid = 'public.categories_category_seq'::regclass;
          seqid          | seq_chunk_size | seq_allocated_up_to | seq_nallocs |       seq_last_alloc
-------------------------+----------------+---------------------+-------------+-----------------------------
 categories_category_seq |        1000000 |             4000333 |           4 | 2020-05-21 20:02:15.957835+00
(1 row)

To see the ranges currently assigned to a given sequence on each node, use these queries:

  • Node Node1 is using range from 333 to 2000333.
SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,2)'; -- first range
 range_start | range_end
-------------+-----------
         334 |   1000333
(1 row)

SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,3)'; -- second range
 range_start | range_end
-------------+-----------
     1000334 |   2000333
(1 row)
  • Node Node2 is using range from 2000004 to 4000003.
SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,2)'; -- first range
 range_start | range_end
-------------+-----------
     2000334 |   3000333
(1 row)

SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,3)'; -- second range
 range_start | range_end
-------------+-----------
     3000334 |   4000333

NOTE You can't combine it to single query (like WHERE ctid IN ('(0,2)', '(0,3)')) as that will still only show the first range.

When a node finishes a chunk, it will ask a consensus for a new one and get the first available; in our case, it will be from 4000334 to 5000333. This will be the new reserved chunk and it will start to consume the old reserved chunk.

UUIDs, KSUUIDs and Other Approaches

There are other ways to generate globally unique ids without using the global sequences that can be used with BDR. For example:

  • UUIDs, and their BDR variant, KSUUIDs
  • Local sequences with a different offset per node (i.e. manual)
  • An externally co-ordinated natural key

Please note that BDR applications cannot use other methods safely: counter-table based approaches relying on SELECT ... FOR UPDATE, UPDATE ... RETURNING ... or similar for sequence generation will not work correctly in BDR, because BDR does not take row locks between nodes. The same values will be generated on more than one node. For the same reason, the usual strategies for "gapless" sequence generation do not work with BDR. In most cases the application should coordinate generation of sequences that must be gapless from some external source using two-phase commit, or it should only generate them on one node in the BDR group.

UUIDs and KSUUIDs

UUID keys instead avoid sequences entirely and use 128-bit universal unique identifiers. These are random or pseudorandom values that are so large that it is nearly impossible for the same value to be generated twice. There is no need for nodes to have continuous communication when using UUID keys.

In the incredibly unlikely event of a collision, conflict detection will choose the newer of the two inserted records to retain. Conflict logging, if enabled, will record such an event, but it is exceptionally unlikely to ever occur, since collisions only become practically likely after about 2^64 keys have been generated.

The main downside of UUID keys is that they're somewhat space- and network inefficient, consuming more space not only as a primary key, but also where referenced in foreign keys and when transmitted on the wire. Additionally, not all applications cope well with UUID keys.

BDR provides functions for working with a K-Sortable variant of UUID data, known as KSUUID, which generates values that can be stored using PostgreSQL's standard UUID data type. A KSUUID value is similar to UUIDv1 in that it stores both timestamp and random data, following the UUID standard. The difference is that KSUUID is K-Sortable, meaning that it's weakly sortable by timestamp. This makes it more useful as a database key as it produces more compact btree indexes, which improves the effectiveness of search, and allows natural time-sorting of result data. Unlike UUIDv1, KSUUID values do not include the MAC of the computer on which they were generated, so there should be no security concerns from using KSUUIDs.

KSUUID v2 is now recommended in all cases. Values generated are directly sortable with regular comparison operators.

There are two versions of KSUUID in BDR, v1 and v2. The legacy KSUUID v1 is now deprecated but is kept in order to support existing installations and should not be used for new installations. The internal contents of the v1 and v2 are not compatible, and as such the functions to manipulate them are also not compatible. The v2 of KSUUID also no longer stores the UUID version number.

Step & Offset Sequences

In offset-step sequences, a normal PostgreSQL sequence is used on each node. Each sequence increments by the same amount and starts at differing offsets. For example with step 1000, node1's sequence generates 1001, 2001, 3001, and so on, node2's generates 1002, 2002, 3002, etc. This scheme works well even if the nodes cannot communicate for extended periods, but the designer must specify a maximum number of nodes when establishing the schema, and it requires per-node configuration. However, mistakes can easily lead to overlapping sequences.

It is relatively simple to configure this approach with BDR by creating the desired sequence on one node, like this:

CREATE TABLE some_table (
    generated_value bigint primary key
);

CREATE SEQUENCE some_seq INCREMENT 1000 OWNED BY some_table.generated_value;

ALTER TABLE some_table ALTER COLUMN generated_value SET DEFAULT nextval('some_seq');

... then on each node calling setval() to give each node a different offset starting value, e.g.:

-- On node 1
SELECT setval('some_seq', 1);

-- On node 2
SELECT setval('some_seq', 2);

 -- ... etc

You should be sure to allow a large enough INCREMENT to leave room for all the nodes you may ever want to add, since changing it in future is difficult and disruptive.

If you use bigint values, there is no practical concern about key exhaustion, even if you use offsets of 10000 or more. You'll need hundreds of years, with hundreds of machines, doing millions of inserts per second, to have any chance of approaching exhaustion.

BDR does not currently offer any automation for configuration of the per-node offsets on such step/offset sequences.

Composite Keys

A variant on step/offset sequences is to use a composite key composed of PRIMARY KEY (node_number, generated_value), where the node number is usually obtained from a function that returns a different number on each node. Such a function may be created by temporarily disabling DDL replication and creating a constant SQL function, or by using a one-row table that is not part of a replication set to store a different value in each node.

Global Sequence Management Interfaces

BDR provides an interface for converting between a standard PostgreSQL sequence and the BDR global sequence.

Note that the following functions are considered to be DDL, so DDL replication and global locking applies to them.

bdr.alter_sequence_set_kind

Allows the owner of a sequence to set the kind of a sequence. Once set, seqkind is only visible via the bdr.sequences view; in all other ways the sequence will appear as a normal sequence.

BDR treats this function as DDL, so DDL replication and global locking applies, if that is currently active. See [DDL Replication].

Synopsis

bdr.alter_sequence_set_kind(seqoid regclass, seqkind text)

Parameters

  • seqoid - name or Oid of the sequence to be altered
  • seqkind - local for a standard PostgreSQL sequence, timeshard for BDR global sequence which uses the "time and sharding" based algorithm described in the [BDR Global Sequences] section, or galloc for globally-allocated range sequences which use consensus between nodes to assign unique ranges of sequence numbers to each node

Notes

When changing the sequence kind to galloc, the first allocated range for that sequence will use the sequence start value as the starting point. When there are already existing values used by the sequence before it was changed to galloc, it is recommended to move the starting point so that the newly generated values will not conflict with the existing ones using the following command:

ALTER SEQUENCE seq_name START starting_value RESTART

This function uses the same replication mechanism as DDL statements. This means that the replication is affected by the ddl filters configuration.

The function will take a global DDL lock. It will also lock the sequence locally.

This function is transactional - the effects can be rolled back with the ROLLBACK of the transaction, and the changes are visible to the current transaction.

The bdr.alter_sequence_set_kind function can be only executed by the owner of the sequence, unless bdr.backwards_compatibility is set is set to 30618 or below.

bdr.extract_timestamp_from_timeshard

This function extracts the timestamp component of the timeshard sequence. The return value is of type "timestamptz".

Synopsis

bdr.extract_timestamp_from_timeshard(timeshard_seq bigint)

Parameters

  • timeshard_seq - value of a timeshard sequence

Notes

This function is only executed on the local node.

bdr.extract_nodeid_from_timeshard

This function extracts the nodeid component of the timeshard sequence.

Synopsis

bdr.extract_nodeid_from_timeshard(timeshard_seq bigint)

Parameters

  • timeshard_seq - value of a timeshard sequence

Notes

This function is only executed on the local node.

bdr.extract_localseqid_from_timeshard

This function extracts the local sequence value component of the timeshard sequence.

Synopsis

bdr.extract_localseqid_from_timeshard(timeshard_seq bigint)

Parameters

  • timeshard_seq - value of a timeshard sequence

Notes

This function is only executed on the local node.

bdr.timestamp_to_timeshard

This function converts a timestamp value to a dummy timeshard sequence value.

This is useful for doing indexed searches or comparisons of values in the timeshard column and for a specific timestamp.

For example, given a table foo with a column id which is using a timeshard sequence, we can get the number of changes since yesterday midnight like this:

SELECT count(1) FROM foo WHERE id > bdr.timestamp_to_timeshard('yesterday')

A query formulated this way will use an index scan on the column id.

Synopsis

bdr.timestamp_to_timeshard(ts timestamptz)

Parameters

  • ts - timestamp to be used for the timeshard sequence generation

Notes

This function is only executed on local node.

KSUUID v2 Functions

Functions for working with KSUUID v2 data, K-Sortable UUID data.

bdr.gen_ksuuid_v2

This function generates a new KSUUID v2 value, using the value of timestamp passed as an argument or current system time if NULL is passed. If you want to generate KSUUID automatically using system time, pass NULL argument.

The return value is of type "UUID".

Synopsis

bdr.gen_ksuuid_v2(timestamptz)

Notes

This function is only executed on the local node.

bdr.ksuuid_v2_cmp

This function compares the KSUUID v2 values.

It returns 1 if first value is newer, -1 if second value is lower, or zero if they are equal.

Synopsis

bdr.ksuuid_v2_cmp(uuid, uuid)

Parameters

  • UUID - KSUUID v2 to compare

Notes

This function is only executed on local node.

bdr.extract_timestamp_from_ksuuid_v2

This function extracts the timestamp component of KSUUID v2. The return value is of type "timestamptz".

Synopsis

bdr.extract_timestamp_from_ksuuid_v2(uuid)

Parameters

  • UUID - KSUUID v2 value to extract timestamp from

Notes

This function is only executed on the local node.

KSUUID v1 Functions

Functions for working with KSUUID v1 data, K-Sortable UUID data(v1).

bdr.gen_ksuuid

This function generates a new KSUUID v1 value, using the current system time. The return value is of type "UUID".

Synopsis

bdr.gen_ksuuid()

Notes

This function is only executed on the local node.

bdr.uuid_v1_cmp

This function compares the KSUUID v1 values.

It returns 1 if first value is newer, -1 if second value is lower, or zero if they are equal.

Synopsis

bdr.uuid_v1_cmp(uuid, uuid)

Notes

This function is only executed on the local node.

Parameters

  • UUID - KSUUID v1 to compare

bdr.extract_timestamp_from_ksuuid

This function extracts the timestamp component of KSUUID v1 or UUIDv1 values. The return value is of type "timestamptz".

Synopsis

bdr.extract_timestamp_from_ksuuid(uuid)

Parameters

  • UUID - KSUUID v1 value to extract timestamp from

Notes

This function is only executed on the local node.