2.3.2 International Components for Unicode

Table of Contents Previous Next


2 Database Administration : 2.3 Unicode Collation Algorithm : 2.3.2 International Components for Unicode

The Unicode Collation Algorithm is implemented by open source software provided by the International Components for Unicode (ICU). The software is a set of C/C++ and Java libraries.
An ICU short form is a method of specifying collation attributes, which are the properties of a collation. Section 2.3.2.2 provides additional information on collation attributes.
The system catalog pg_catalog.pg_icu_collate_names contains a list of the names of the ICU short forms for locales. The ICU short form name is listed in column icu_short_form.
When creating an ICU collation, the desired characteristics of the collation must be specified. As discussed in Section 2.3.2.1, this can typically be done with an ICU short form for the desired locale. However, if more specific information is required, the specification of the collation properties can be done by using collation attributes.
A – Alternate (N, S, D). Handles treatment of variable characters such as white spaces, punctuation marks, and symbols. When set to non-ignorable (N), differences in variable characters are treated with the same importance as differences in letters. When set to shifted (S), then differences in variable characters are of minor importance (that is, the variable character is ignored when comparing base characters).
C – Case First (X, L, U, D). Controls whether a lowercase letter sorts before the same uppercase letter (L), or the uppercase letter sorts before the same lowercase letter (U). Off (X) is typically specified when lowercase first (L) is desired.
E – Case Level (X, O, D). Set in combination with the Strength attribute, the Case Level attribute is used when accents are to be ignored, but not case.
F – French Collation (X, O, D). When set to on, secondary differences (presence of accents) are sorted from the back of the string as done in the French Canadian locale.
H – Hiragana Quaternary (X, O, D). Introduces an additional level to distinguish between the Hiragana and Katakana characters for compatibility with the JIS X 4061 collation of Japanese character strings.
N – Normalization Checking (X, O, D). Controls whether or not text is thoroughly normalized for comparison. Normalization deals with the issue of canonical equivalence of text whereby different code point sequences represent the same character, which then present issues when sorting or comparing such characters. Languages such as Arabic, ancient Greek, Hebrew, Hindi, Thai, or Vietnamese should be used with Normalization Checking set to on.
S – Strength (1, 2, 3, 4, I, D). Maximum collation level used for comparison. Influences whether accents or case are taken into account when collating or comparing strings. Each number represents a level. A setting of I represents identical strength (that is, level 5).
T – Variable Top (hexadecimal digits). Applicable only when the Alternate attribute is not set to non-ignorable (N). The hexadecimal digits specify the highest character sequence that is to be considered ignorable. For example, if white space is to be ignorable, but visible variable characters are not to be ignorable, then Variable Top set to 0020 would be specified along with the Alternate attribute set to S and the Strength attribute set to 3. (The space character is hexadecimal 0020. Other non-visible variable characters such as backspace, tab, line feed, carriage return, etc. have values less than 0020. All visible punctuation marks have values greater than 0020.)
The following is an example where the ICU short form named LROOT is modified with a number of other collation attribute/value pairs.
In the preceding example, the Alternate attribute (A) is set to non-ignorable (N). The Case First attribute (C) is set to off (X). The Case Level attribute (E) is set to off (X). The Normalization attribute (N) is set to on (O). The Strength attribute (S) is set to the tertiary level 3. LROOT is the ICU short form to which these other attributes are applying modifications.

2 Database Administration : 2.3 Unicode Collation Algorithm : 2.3.2 International Components for Unicode

Table of Contents Previous Next