The Unicode Collation Algorithm is implemented by open source software provided by the International Components for Unicode (ICU). The software is a set of C/C++ and Java libraries.When Advanced Server is used to create a collation that invokes the ICU components to produce the collation, the result is referred to as an ICU collation.220.127.116.11 Locale CollationsWhen creating a collation for a locale, a predefined ICU short form name for the given locale is typically provided.An ICU short form is a method of specifying collation attributes, which are the properties of a collation. Section 18.104.22.168 provides additional information on collation attributes.There are predefined ICU short forms for locales. The ICU short form for a locale incorporates the collation attribute settings typically used for the given locale. This simplifies the collation creation process by eliminating the need to specify the entire list of collation attributes for that locale.The system catalog pg_catalog.pg_icu_collate_names contains a list of the names of the ICU short forms for locales. The ICU short form name is listed in column icu_short_form.If needed, the default characteristics of an ICU short form for a given locale can be overridden by specifying the collation attributes to override that property. This is discussed in the next section.22.214.171.124 Collation AttributesWhen creating an ICU collation, the desired characteristics of the collation must be specified. As discussed in Section 126.96.36.199, this can typically be done with an ICU short form for the desired locale. However, if more specific information is required, the specification of the collation properties can be done by using collation attributes.Collation attributes define the rules of how characters are to be compared for determining the collation sequence of text strings. As Unicode covers a vast set of languages in numerous variations according to country, territory and culture, these collation attributes are quite complex.For the complete, precise meaning and usage of collation attributes, see Section 13 “Collator Naming Scheme” on the ICU – International Components for Unicode website at:The following is a brief summary of the collation attributes and how they are specified using the ICU short form methodEach collation attribute is represented by an uppercase letter, which are listed in the following bullet points. The possible valid values for each attribute are given by codes shown within the parentheses. Some codes have general meanings for all attributes. X means to set the attribute off. O means to set the attribute on. D means to set the attribute to its default value.
• A – Alternate (N, S, D). Handles treatment of variable characters such as white spaces, punctuation marks, and symbols. When set to non-ignorable (N), differences in variable characters are treated with the same importance as differences in letters. When set to shifted (S), then differences in variable characters are of minor importance (that is, the variable character is ignored when comparing base characters).
• C – Case First (X, L, U, D). Controls whether a lowercase letter sorts before the same uppercase letter (L), or the uppercase letter sorts before the same lowercase letter (U). Off (X) is typically specified when lowercase first (L) is desired.
• E – Case Level (X, O, D). Set in combination with the Strength attribute, the Case Level attribute is used when accents are to be ignored, but not case.
• F – French Collation (X, O, D). When set to on, secondary differences (presence of accents) are sorted from the back of the string as done in the French Canadian locale.
• H – Hiragana Quaternary (X, O, D). Introduces an additional level to distinguish between the Hiragana and Katakana characters for compatibility with the JIS X 4061 collation of Japanese character strings.
• N – Normalization Checking (X, O, D). Controls whether or not text is thoroughly normalized for comparison. Normalization deals with the issue of canonical equivalence of text whereby different code point sequences represent the same character, which then present issues when sorting or comparing such characters. Languages such as Arabic, ancient Greek, Hebrew, Hindi, Thai, or Vietnamese should be used with Normalization Checking set to on.
• S – Strength (1, 2, 3, 4, I, D). Maximum collation level used for comparison. Influences whether accents or case are taken into account when collating or comparing strings. Each number represents a level. A setting of I represents identical strength (that is, level 5).
• T – Variable Top (hexadecimal digits). Applicable only when the Alternate attribute is not set to non-ignorable (N). The hexadecimal digits specify the highest character sequence that is to be considered ignorable. For example, if white space is to be ignorable, but visible variable characters are not to be ignorable, then Variable Top set to 0020 would be specified along with the Alternate attribute set to S and the Strength attribute set to 3. (The space character is hexadecimal 0020. Other non-visible variable characters such as backspace, tab, line feed, carriage return, etc. have values less than 0020. All visible punctuation marks have values greater than 0020.)A set of collation attributes and their values is represented by a text string consisting of the collation attribute letter concatenated with the desired attribute value. Each attribute/value pair is joined to the next pair with an underscore character as shown by the following example.Collation attributes can be specified along with a locale’s ICU short form name to override those default attribute settings of the locale.The following is an example where the ICU short form named LROOT is modified with a number of other collation attribute/value pairs.In the preceding example, the Alternate attribute (A) is set to non-ignorable (N). The Case First attribute (C) is set to off (X). The Case Level attribute (E) is set to off (X). The Normalization attribute (N) is set to on (O). The Strength attribute (S) is set to the tertiary level 3. LROOT is the ICU short form to which these other attributes are applying modifications.