The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. The sort order of text in an SQL database is influenced by these so-called collations. These usually work behind the scenes, but they offer a number of interesting customization possibilities, while also being a potential source of operational problems and user confusion.
In order to get a deeper understanding of how modern collations affect the management of databases, along with its features, problems, and future development, 2ndQuadrant hosted a live webinar, Collations: Introduction, Features, Problems.
This webinar was presented by Peter Eisentraut (Core Team Member of the PostgreSQL Project and CTO at 2ndQuadrant), in which he covered the following topics:
- How modern collations work?
- The role of Unicode.
- Which variants and customization options are available, such as language-specific collation and case insensitivity, and how PostgreSQL implements these options.
- Relevant development projects for future PostgreSQL versions.
- How collations affect the management of databases, especially upgrades.
Those who weren’t able to attend the live webinar can now view the recording here.
Due to limited time, some of the questions were not answered during the live webinar, so answers by the host are provided below:
Question: What does the behavior on windows for the collation provider?
Answer: On Windows, you also have the option of the libc and the ICU collation providers. The libc provider, in this case, is the Windows operating system libraries; ICU is the same external library that you have available on other operating systems.
Question: Can you mention what you can run to check if your indexes are corrupted due to the glibc update?
Answer: To check indexes for corruption, either due to collation changes or any other effect, use the amcheck contrib module.
See https://www.postgresql.org/docs/current/amcheck.html.
Question: In terms of efficiency, how does ICU case insensitive, accent insensitive compares with citex and unnaccent?
Answer: In general, ICU is extremely well optimized. Details depend on the data set and the operations (sort or lookup, lots of characters needing special treatment or few). I always recommend checking yourself on your data. However, ICU is clearly going to be the preferred option in the long run, so I’d recommend checking that first.
Question: In terms of efficiency, how does ICU case insensitive, accent insensitive compares with citex and unnaccent?
Answer: In general, ICU is extremely well optimized. Details depend on the data set and the operations (sort or lookup, lots of characters needing special treatment or few). I always recommend checking yourself on your data. However, ICU is clearly going to be the preferred option in the long run, so I’d recommend checking that first.
To be the first to know about upcoming PostgreSQL webinars by 2ndQuadrant, visit our Webinars page.
For any questions, comments, or feedback, please visit our website or send an email to webinar@2ndquadrant.com.