| | The character set support in EnterpriseDB
allows you to store text in a variety of character sets, including
single-byte character sets such as the ISO 8859 series and
multiple-byte character sets such as EUC (Extended Unix
Code), UTF-8, and Mule internal code. All supported character sets
can be used transparently by clients, but a few are not supported
for use within the server (that is, as a server-side encoding).
The default character set is selected while
initializing your EnterpriseDB database
cluster using initdb. It can be overridden when you
create a database, so you can have multiple
databases each with a different character set.
Table 34-1 shows the character sets available
for use in EnterpriseDB.
Table 34-1. EnterpriseDB Character Sets | Name | Description | Language | Server? | Bytes/Char | Aliases |
|---|
| BIG5 | Big Five | Traditional Chinese | No | 1-2 | WIN950, Windows950 | | EUC_CN | Extended UNIX Code-CN | Simplified Chinese | Yes | 1-3 | | | EUC_JP | Extended UNIX Code-JP | Japanese | Yes | 1-3 | | | EUC_KR | Extended UNIX Code-KR | Korean | Yes | 1-3 | | | EUC_TW | Extended UNIX Code-TW | Traditional Chinese, Taiwanese | Yes | 1-3 | | | GB18030 | National Standard | Chinese | No | 1-2 | | | GBK | Extended National Standard | Simplified Chinese | No | 1-2 | WIN936, Windows936 | | ISO_8859_5 | ISO 8859-5, ECMA 113 | Latin/Cyrillic | Yes | 1 | | | ISO_8859_6 | ISO 8859-6, ECMA 114 | Latin/Arabic | Yes | 1 | | | ISO_8859_7 | ISO 8859-7, ECMA 118 | Latin/Greek | Yes | 1 | | | ISO_8859_8 | ISO 8859-8, ECMA 121 | Latin/Hebrew | Yes | 1 | | | JOHAB | JOHAB | Korean (Hangul) | Yes | 1-3 | | | KOI8 | KOI8-R(U) | Cyrillic | Yes | 1 | KOI8R | | LATIN1 | ISO 8859-1, ECMA 94 | Western European | Yes | 1 | ISO88591 | | LATIN2 | ISO 8859-2, ECMA 94 | Central European | Yes | 1 | ISO88592 | | LATIN3 | ISO 8859-3, ECMA 94 | South European | Yes | 1 | ISO88593 | | LATIN4 | ISO 8859-4, ECMA 94 | North European | Yes | 1 | ISO88594 | | LATIN5 | ISO 8859-9, ECMA 128 | Turkish | Yes | 1 | ISO88599 | | LATIN6 | ISO 8859-10, ECMA 144 | Nordic | Yes | 1 | ISO885910 | | LATIN7 | ISO 8859-13 | Baltic | Yes | 1 | ISO885913 | | LATIN8 | ISO 8859-14 | Celtic | Yes | 1 | ISO885914 | | LATIN9 | ISO 8859-15 | LATIN1 with Euro and accents | Yes | 1 | ISO885915 | | LATIN10 | ISO 8859-16, ASRO SR 14111 | Romanian | Yes | 1 | ISO885916 | | MULE_INTERNAL | Mule internal code | Multilingual Emacs | Yes | 1-4 | | | SJIS | Shift JIS | Japanese | No | 1-2 | Mskanji, ShiftJIS, WIN932, Windows932 | | SQL_ASCII | unspecified (see text) | any | Yes | 1 | | | UHC | Unified Hangul Code | Korean | No | 1-2 | WIN949, Windows949 | | UTF8 | Unicode, 8-bit | all | Yes | 1-4 | Unicode | | WIN866 | Windows CP866 | Cyrillic | Yes | 1 | ALT | | WIN874 | Windows CP874 | Thai | Yes | 1 | | | WIN1250 | Windows CP1250 | Central European | Yes | 1 | | | WIN1251 | Windows CP1251 | Cyrillic | Yes | 1 | WIN | | WIN1252 | Windows CP1252 | Western European | Yes | 1 | | | WIN1253 | Windows CP1253 | Greek | Yes | 1 | | | WIN1254 | Windows CP1254 | Turkish | Yes | 1 | | | WIN1255 | Windows CP1255 | Hebrew | Yes | 1 | | | WIN1256 | Windows CP1256 | Arabic | Yes | 1 | | | WIN1257 | Windows CP1257 | Baltic | Yes | 1 | | | WIN1258 | Windows CP1258 | Vietnamese | Yes | 1 | ABC, TCVN, TCVN5712, VSCII |
Not all APIs support all the listed character sets. For example, the
EnterpriseDB
JDBC driver does not support MULE_INTERNAL, LATIN6,
LATIN8, and LATIN10.
initdb defines the default character set
for a EnterpriseDB cluster. For example,
initdb -E EUC_JP
sets the default character set (encoding) to
EUC_JP (Extended Unix Code for Japanese). You
can use --encoding instead of
-E if you prefer to type longer option strings.
If no -E or --encoding option is
given, SQL_ASCII is used.
You can create a database with a different character set:
createdb -E EUC_KR korean
This will create a database named korean that
uses the character set EUC_KR. Another way to
accomplish this is to use this SQL command:
CREATE DATABASE korean WITH ENCODING 'EUC_KR';
The encoding for a database is stored in the system catalog
pg_database. You can see that by using the
-l option or the \l command
of gsql.
$ gsql -l
List of databases
Database | Owner | Encoding
---------------+---------+---------------
euc_cn | t-ishii | EUC_CN
euc_jp | t-ishii | EUC_JP
euc_kr | t-ishii | EUC_KR
euc_tw | t-ishii | EUC_TW
mule_internal | t-ishii | MULE_INTERNAL
regression | t-ishii | SQL_ASCII
template1 | t-ishii | EUC_JP
test | t-ishii | EUC_JP
unicode | t-ishii | UNICODE
(9 rows)
EnterpriseDB supports automatic
character set conversion between server and client for certain
character set combinations. The conversion information is stored in the
pg_conversion system catalog. EnterpriseDB
comes with some predefined conversions, as shown in Table 34-2. You can create a new
conversion using the SQL command CREATE CONVERSION.
Table 34-2. Client/Server Character Set Conversions | Server Character Set | Available Client Character Sets |
|---|
| BIG5 | not supported as a server encoding
| | EUC_CN | EUC_CN,
MULE_INTERNAL,
UTF8
| | EUC_JP | EUC_JP,
MULE_INTERNAL,
SJIS,
UTF8
| | EUC_KR | EUC_KR,
MULE_INTERNAL,
UTF8
| | EUC_TW | EUC_TW,
BIG5,
MULE_INTERNAL,
UTF8
| | GB18030 | not supported as a server encoding
| | GBK | not supported as a server encoding
| | ISO_8859_5 | ISO_8859_5,
KOI8,
MULE_INTERNAL,
UTF8,
WIN866,
WIN1251
| | ISO_8859_6 | ISO_8859_6,
UTF8
| | ISO_8859_7 | ISO_8859_7,
UTF8
| | ISO_8859_8 | ISO_8859_8,
UTF8
| | JOHAB | JOHAB,
UTF8
| | KOI8 | KOI8,
ISO_8859_5,
MULE_INTERNAL,
UTF8,
WIN866,
WIN1251
| | LATIN1 | LATIN1,
MULE_INTERNAL,
UTF8
| | LATIN2 | LATIN2,
MULE_INTERNAL,
UTF8,
WIN1250
| | LATIN3 | LATIN3,
MULE_INTERNAL,
UTF8
| | LATIN4 | LATIN4,
MULE_INTERNAL,
UTF8
| | LATIN5 | LATIN5,
UTF8
| | LATIN6 | LATIN6,
UTF8
| | LATIN7 | LATIN7,
UTF8
| | LATIN8 | LATIN8,
UTF8
| | LATIN9 | LATIN9,
UTF8
| | LATIN10 | LATIN10,
UTF8
| | MULE_INTERNAL | MULE_INTERNAL,
BIG5,
EUC_CN,
EUC_JP,
EUC_KR,
EUC_TW,
ISO_8859_5,
KOI8,
LATIN1 to LATIN4,
SJIS,
WIN866,
WIN1250,
WIN1251
| | SJIS | not supported as a server encoding
| | SQL_ASCII | any (no conversion will be performed)
| | UHC | not supported as a server encoding
| | UTF8 | all supported encodings
| | WIN866 | WIN866,
ISO_8859_5,
KOI8,
MULE_INTERNAL,
UTF8,
WIN1251
| | WIN874 | WIN874,
UTF8
| | WIN1250 | WIN1250,
LATIN2,
MULE_INTERNAL,
UTF8
| | WIN1251 | WIN1251,
ISO_8859_5,
KOI8,
MULE_INTERNAL,
UTF8,
WIN866
| | WIN1252 | WIN1252,
UTF8
| | WIN1253 | WIN1253,
UTF8
| | WIN1254 | WIN1254,
UTF8
| | WIN1255 | WIN1255,
UTF8
| | WIN1256 | WIN1256,
UTF8
| | WIN1257 | WIN1257,
UTF8
| | WIN1258 | WIN1258,
UTF8
|
To enable the automatic character set conversion, you have to
tell EnterpriseDB the character set
(encoding) you would like to use in the client. There are several
ways to accomplish this:
Using the \encoding command in
gsql.
\encoding allows you to change client
encoding on the fly. For
example, to change the encoding to SJIS, type:
\encoding SJIS
Using libpq functions.
\encoding actually calls
PQsetClientEncoding() for its purpose.
int PQsetClientEncoding(PGconn *conn, const char *encoding);
where conn is a connection to the server,
and encoding is the encoding you
want to use. If the function successfully sets the encoding, it returns 0,
otherwise -1. The current encoding for this connection can be determined by
using:
int PQclientEncoding(const PGconn *conn);
Note that it returns the encoding ID, not a symbolic string
such as EUC_JP. To convert an encoding ID to an encoding name, you
can use:
char *pg_encoding_to_char(int encoding_id);
Using SET client_encoding TO.
Setting the client encoding can be done with this SQL command:
SET CLIENT_ENCODING TO 'value';
Also you can use the standard SQL syntax SET NAMES for this purpose:
SET NAMES 'value';
To query the current client encoding:
SHOW client_encoding;
To return to the default encoding:
RESET client_encoding;
Using PGCLIENTENCODING. If environment variable
PGCLIENTENCODING is defined in the client's
environment, that client encoding is automatically selected
when a connection to the server is made. (This can
subsequently be overridden using any of the other methods
mentioned above.)
Using the configuration variable client_encoding. If the
client_encoding variable is set, that client
encoding is automatically selected when a connection to the
server is made. (This can subsequently be overridden using any
of the other methods mentioned above.)
If the conversion of a particular character is not possible
— suppose you chose EUC_JP for the
server and LATIN1 for the client, then some
Japanese characters do not have a representation in
LATIN1 — then an error is reported.
These are good sources to start learning about various kinds of encoding
systems.
- ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
Detailed explanations of EUC_JP,
EUC_CN, EUC_KR,
EUC_TW appear in section 3.2.
- http://www.unicode.org
The web site of the Unicode Consortium
- RFC 2044
UTF-8 is defined here.
| |
|---|