Unicode collation algorithm v14

The Unicode Collation Algorithm (UCA) is a specification (Unicode Technical Report #10) that defines a customizable method of collating and comparing Unicode data. Collation means how data is sorted as with a SELECT … ORDER BY clause. Comparison is relevant for searches that use ranges with less than, greater than, or equal to operators.

Customizability is an important factor for various reasons such as the following.

  • Unicode supports a vast number of languages. Letters that may be common to several languages may be expected to collate in different orders depending upon the language.
  • Characters that appear with letters in certain languages such as accents or umlauts have an impact on the expected collation depending upon the language.
  • In some languages, combinations of several consecutive characters should be treated as a single character with regards to its collation sequence.
  • There may be certain preferences as to the collation of letters according to case. For example, should the lowercase form of a letter collate before the uppercase form of the same letter or vice versa.
  • There may be preferences as to whether punctuation marks such as underscore characters, hyphens, or space characters should be considered in the collating sequence or should they simply be ignored as if they did not exist in the string.

Given all of these variations with the vast number of languages supported by Unicode, there is a necessity for a method to select the specific criteria for determining a collating sequence. This is what the Unicode Collation Algorithm defines.

Note

In addition, another advantage for using ICU collations (the implementation of the Unicode Collation Algorithm) is for performance. Sorting tasks, including B-tree index creation, can complete in less than half the time it takes with a non-ICU collation. The exact performance gain depends on your operating system version, the language of your text data, and other factors.

The following sections provide a brief, simplified explanation of the Unified Collation Algorithm concepts. As the algorithm and its usage are quite complex with numerous variations, refer to the official documents cited in these sections for complete details.

Basic Unicode collation algorithm concepts

The official information for the Unicode Collation Algorithm is specified in Unicode Technical Report #10, which can be found on The Unicode Consortium website at:

http://www.unicode.org/reports/tr10/

The ICU – International Components for Unicode also provides much useful information. An explanation of the collation concepts can be found on their website located at:

http://userguide.icu-project.org/collation/concepts

The basic concept behind the Unicode Collation Algorithm is the use of multilevel comparison. This means that a number of levels are defined, which are listed as level 1 through level 5 in the following bullet points. Each level defines a type of comparison. Strings are first compared using the primary level, also called level 1.

If the order can be determined based on the primary level, then the algorithm is done. If the order cannot be determined based on the primary level, then the secondary level, level 2, is applied. If the order can be determined based on the secondary level, then the algorithm is done, otherwise the tertiary level is applied, and so on. There is typically, a final tie-breaking level to determine the order if it cannot be resolved by the prior levels.

  • Level 1 – Primary Level for Base Characters. The order of basic characters such as letters and digits determines the difference such as A < B.
  • Level 2 – Secondary Level for Accents. If there are no primary level differences, then the presence or absence of accents and other such characters determine the order such as a < á.
  • Level 3 – Tertiary Level for Case. If there are no primary level or secondary level differences, then a difference in case determines the order such as a < A.
  • Level 4 – Quaternary Level for Punctuation. If there are no primary, secondary, or tertiary level differences, then the presence or absence of white space characters, control characters, and punctuation determine the order such as -A < A.
  • Level 5 – Identical Level for Tie-Breaking. If there are no primary, secondary, tertiary, or quaternary level differences, then some other difference such as the code point values determines the order.

International components for Unicode

The Unicode Collation Algorithm is implemented by open source software provided by the International Components for Unicode (ICU). The software is a set of C/C++ and Java libraries.

When EDB Postgres Advanced Server is used to create a collation that invokes the ICU components to produce the collation, the result is referred to as an ICU collation.

Locale collations

When creating a collation for a locale, a predefined ICU short form name for the given locale is typically provided.

An ICU short form is a method of specifying collation attributes, which are the properties of a collation. Collation Attributes provides additional information on collation attributes.

There are predefined ICU short forms for locales. The ICU short form for a locale incorporates the collation attribute settings typically used for the given locale. This simplifies the collation creation process by eliminating the need to specify the entire list of collation attributes for that locale.

The system catalog pg_catalog.pg_icu_collate_names contains a list of the names of the ICU short forms for locales. The ICU short form name is listed in column icu_short_form.

edb=# SELECT icu_short_form, valid_locale FROM pg_icu_collate_names ORDER BY
valid_locale;
 icu_short_form | valid_locale
----------------+--------------
 LAF            | af
 LAR            | ar
 LAS            | as
 LAZ            | az
 LBE            | be
 LBG            | bg
 LBN            | bn
 LBS            | bs
 LBS_ZCYRL      | bs_Cyrl
 LROOT          | ca
 LROOT          | chr
 LCS            | cs
 LCY            | cy
 LDA            | da
 LROOT          | de
 LROOT          | dz
 LEE            | ee
 LEL            | el
 LROOT          | en
 LROOT          | en_US
 LEN_RUS_VPOSIX | en_US_POSIX
 LEO            | eo
 LES            | es
 LET            | et
 LFA            | fa
 LFA_RAF        | fa_AF
    .
    .
    .

If needed, the default characteristics of an ICU short form for a given locale can be overridden by specifying the collation attributes to override that property. This is discussed in the next section.

Collation attributes

When creating an ICU collation, the desired characteristics of the collation must be specified. As discussed in Locale Collations, this can typically be done with an ICU short form for the desired locale. However, if more specific information is required, the specification of the collation properties can be done by using collation attributes.

Collation attributes define the rules of how characters are to be compared for determining the collation sequence of text strings. As Unicode covers a vast set of languages in numerous variations according to country, territory and culture, these collation attributes are quite complex.

For the complete, precise meaning and usage of collation attributes, see Section “Collator Naming Scheme” on the ICU – International Components for Unicode website at:

http://userguide.icu-project.org/collation/concepts

The following is a brief summary of the collation attributes and how they are specified using the ICU short form method

Each collation attribute is represented by an uppercase letter, which are listed in the following bullet points. The possible valid values for each attribute are given by codes shown within the parentheses. Some codes have general meanings for all attributes. X means to set the attribute off. O means to set the attribute on. D means to set the attribute to its default value.

  • A – Alternate (N, S, D). Handles treatment of variable characters such as white spaces, punctuation marks, and symbols. When set to non-ignorable (N), differences in variable characters are treated with the same importance as differences in letters. When set to shifted (S), then differences in variable characters are of minor importance (that is, the variable character is ignored when comparing base characters).
  • C – Case first (X, L, U, D). Controls whether a lowercase letter sorts before the same uppercase letter (L), or the uppercase letter sorts before the same lowercase letter (U). Off (X) is typically specified when lowercase first (L) is desired.
  • E – Case level (X, O, D). Set in combination with the Strength attribute, the Case Level attribute is used when accents are to be ignored, but not case.
  • F – French collation (X, O, D). When set to on, secondary differences (presence of accents) are sorted from the back of the string as done in the French Canadian locale.
  • H – Hiragana quaternary (X, O, D). Introduces an additional level to distinguish between the Hiragana and Katakana characters for compatibility with the JIS X 4061 collation of Japanese character strings.
  • N – Normalization checking (X, O, D). Controls whether or not text is thoroughly normalized for comparison. Normalization deals with the issue of canonical equivalence of text whereby different code point sequences represent the same character, which then present issues when sorting or comparing such characters. Languages such as Arabic, ancient Greek, Hebrew, Hindi, Thai, or Vietnamese should be used with Normalization Checking set to on.
  • S – Strength (1, 2, 3, 4, I, D). Maximum collation level used for comparison. Influences whether accents or case are taken into account when collating or comparing strings. Each number represents a level. A setting of I represents identical strength (that is, level 5).
  • T – Variable top (hexadecimal digits). Applicable only when the Alternate attribute is not set to non-ignorable (N). The hexadecimal digits specify the highest character sequence that is to be considered ignorable. For example, if white space is to be ignorable, but visible variable characters are not to be ignorable, then Variable Top set to 0020 would be specified along with the Alternate attribute set to S and the Strength attribute set to 3. (The space character is hexadecimal 0020. Other non-visible variable characters such as backspace, tab, line feed, carriage return, etc. have values less than 0020. All visible punctuation marks have values greater than 0020.)

A set of collation attributes and their values is represented by a text string consisting of the collation attribute letter concatenated with the desired attribute value. Each attribute/value pair is joined to the next pair with an underscore character as shown by the following example.

AN_CX_EX_FX_HX_NO_S3

Collation attributes can be specified along with a locale’s ICU short form name to override those default attribute settings of the locale.

The following is an example where the ICU short form named LROOT is modified with a number of other collation attribute/value pairs.

AN_CX_EX_LROOT_NO_S3

In the preceding example, the Alternate attribute (A) is set to non-ignorable (N). The Case First attribute (C) is set to off (X). The Case Level attribute (E) is set to off (X). The Normalization attribute (N) is set to on (O). The Strength attribute (S) is set to the tertiary level 3. LROOT is the ICU short form to which these other attributes are applying modifications.

Using a collation

A newly defined ICU collation can be used anywhere the COLLATION "collation_name" clause can be used in a SQL command such as in the column specifications of the CREATE TABLE command or appended to an expression in the ORDER BY clause of a SELECT command.

The following are some examples of the creation and usage of ICU collations based on the English language in the United States (en_US.UTF8).

In these examples, ICU collations are created with the following characteristics.

Collation icu_collate_lowercase forces the lowercase form of a letter to sort before its uppercase counterpart (CL).

Collation icu_collate_uppercase forces the uppercase form of a letter to sort before its lowercase counterpart (CU).

Collation icu_collate_ignore_punct causes variable characters (white space and punctuation marks) to be ignored during sorting (AS).

Collation icu_collate_ignore_white_sp causes white space and other non-visible variable characters to be ignored during sorting, but visible variable characters (punctuation marks) are not ignored (AS, T0020).

CREATE COLLATION icu_collate_lowercase (
    LOCALE = 'en_US.UTF8',
    ICU_SHORT_FORM = 'AN_CL_EX_NX_LROOT'
);

CREATE COLLATION icu_collate_uppercase (
    LOCALE = 'en_US.UTF8',
    ICU_SHORT_FORM = 'AN_CU_EX_NX_LROOT'
);

CREATE COLLATION icu_collate_ignore_punct (
    LOCALE = 'en_US.UTF8',
    ICU_SHORT_FORM = 'AS_CX_EX_NX_LROOT_L3'
);

CREATE COLLATION icu_collate_ignore_white_sp (
    LOCALE = 'en_US.UTF8',
    ICU_SHORT_FORM = 'AS_CX_EX_NX_LROOT_L3_T0020'
);
Note

When creating collations, ICU may generate notice and warning messages when attributes are given to modify the LROOT collation.

The following psql command lists the collations.

edb=# \dO
                                      List of collations
    Schema    |            Name             |  Collate   |   Ctype    |     ICU
--------------+-----------------------------+------------+------------+----------------------------
 enterprisedb | icu_collate_ignore_punct    | en_US.UTF8 | en_US.UTF8 | AS_CX_EX_NX_LROOT_L3
 enterprisedb | icu_collate_ignore_white_sp | en_US.UTF8 | en_US.UTF8 |
 AS_CX_EX_NX_LROOT_L3_T0020
 enterprisedb | icu_collate_lowercase       | en_US.UTF8 | en_US.UTF8 | AN_CL_EX_NX_LROOT
 enterprisedb | icu_collate_uppercase       | en_US.UTF8 | en_US.UTF8 | AN_CU_EX_NX_LROOT
(4 rows)

The following table is created and populated.

CREATE TABLE collate_tbl (
    id              INTEGER,
    c2              VARCHAR(2)
);

INSERT INTO collate_tbl VALUES (1, 'A');
INSERT INTO collate_tbl VALUES (2, 'B');
INSERT INTO collate_tbl VALUES (3, 'C');
INSERT INTO collate_tbl VALUES (4, 'a');
INSERT INTO collate_tbl VALUES (5, 'b');
INSERT INTO collate_tbl VALUES (6, 'c');
INSERT INTO collate_tbl VALUES (7, '1');
INSERT INTO collate_tbl VALUES (8, '2');
INSERT INTO collate_tbl VALUES (9, '.B');
INSERT INTO collate_tbl VALUES (10, '-B');
INSERT INTO collate_tbl VALUES (11, ' B');

The following query sorts on column c2 using the default collation. Note that variable characters (white space and punctuation marks) with id column values of 9, 10, and 11 are ignored and sort with the letter B.

edb=# SELECT * FROM collate_tbl ORDER BY c2;
 id  | c2
-----+----
 7   | 1
 8   | 2
 4   | a
 1   | A
 5   | b
 2   | B
 11  | B
 10  | -B
 9   | .B
 6   | c
 3   | C
(11 rows)

The following query sorts on column c2 using collation icu_collate_lowercase, which forces the lowercase form of a letter to sort before the uppercase form of the same base letter. Also note that the AN attribute forces variable characters to be included in the sort order at the same level when comparing base characters so rows with id values of 9, 10, and 11 appear at the beginning of the sort list before all letters and numbers.

edb=# SELECT * FROM collate_tbl ORDER BY c2 COLLATE "icu_collate_lowercase";
 id | c2
----+----
 11 |  B
 10 | -B
 9  | .B
 7  | 1
 8  | 2
 4  | a
 1  | A
 5  | b
 2  | B
 6  | c
 3  | C
(11 rows)

The following query sorts on column c2 using collation icu_collate_uppercase, which forces the uppercase form of a letter to sort before the lowercase form of the same base letter.

edb=# SELECT * FROM collate_tbl ORDER BY c2 COLLATE "icu_collate_uppercase";
 id | c2
----+----
 11 |  B
 10 | -B
 9  | .B
 7  | 1
 8  | 2
 1  | A
 4  | a
 2  | B
 5  | b
 3  | C
 6  | c
(11 rows)

The following query sorts on column c2 using collation icu_collate_ignore_punct, which causes variable characters to be ignored so rows with id values of 9, 10, and 11 sort with the letter B as that is the character immediately following the ignored variable character.

edb=# SELECT * FROM collate_tbl ORDER BY c2 COLLATE
"icu_collate_ignore_punct";
 id  | c2
-----+----
  7  | 1
  8  | 2
  4  | a
  1  | A
  5  | b
  11 |  B
  2  | B
  9  | .B
  10 | -B
  6  | c
  3  | C
(11 rows)

The following query sorts on column c2 using collation icu_collate_ignore_white_sp. The AS and T0020 attributes of the collation cause variable characters with code points less than or equal to hexadecimal 0020 to be ignored while variable characters with code points greater than hexadecimal 0020 are included in the sort.

The row with id value of 11, which starts with a space character (hexadecimal 0020) sorts with the letter B. The rows with id values of 9 and 10, which start with visible punctuation marks greater than hexadecimal 0020, appear at the beginning of the sort list as these particular variable characters are included in the sort order at the same level when comparing base characters.

edb=# SELECT * FROM collate_tbl ORDER BY c2 COLLATE
"icu_collate_ignore_white_sp";
 id | c2
----+----
 10 | -B
 9  | .B
 7  | 1
 8  | 2
 4  | a
 1  | A
 5  | b
 11 |  B
 2  | B
 6  | c
 3  | C
(11 rows)