Firebird Documentation Index → Firebird 2.1 Release Notes → International Language Support (INTL) |
Table of Contents
This chapter describes the new international language support interface that was introduced with Firebird 2. Since then, a number of additions and improvements have been added, including the ability to implement UNICODE collations from external libraries generically. New DDL syntax has been introduced to assist with this task, in the form of the CREATE COLLATION statement.
Originally described by N. Samofatov, Firebird 2's new interface for international character sets features many enhancements that have been implemented by me.
Firebird allows character sets and collations to be declared in any character field or variable declaration. The default character set can also be specified at database create time, to cause every CHAR/VARCHAR declaration that does not specifically include a CHARACTER SET clause to use this default.
At attachment time you normally specify the character set that the client is to use to read strings. If no "client" (or "connection") character set is specified, character set NONE is assumed.
Two special character sets, NONE and OCTETS, can be used in declarations. However, OCTETS cannot be used as a connection character set. The two sets are similar, except that the space character of NONE is ASCII 0x20, whereas the space character OCTETS is 0x00. NONE and OCTETS are “special” in the sense that they follow different rules from those applicable to other character sets regarding conversions.
With other character sets, conversion is performed as CHARSET1->UNICODE->CHARSET2.
With NONE/OCTETS the bytes are just copied: NONE/OCTETS->CHARSET2 and CHARSET1->NONE/OCTETS.
Enhancements that the new system brings include:
Some character sets (especially multi-byte) do not accept just any string. Now, the engine verifies that strings are well-formed when assigning from NONE/OCTETS and when strings sent by the client (the statement string and parameters).
In Firebird 1.5.x, only the ASCII-equivalent characters are uppercased in any character set's default (binary) collation order, which is the one that is used if no collation is specified.
For example,
isql -q -ch dos850 SQL> create database 'test.fdb'; SQL> create table t (c char(1) character set dos850); SQL> insert into t values ('a'); SQL> insert into t values ('e'); SQL> insert into t values ('á'); SQL> insert into t values ('é'); SQL> SQL> select c, upper(c) from t; C UPPER ====== ====== a A e E á á é é
In Firebird 2 the result is:
C UPPER ====== ====== a A e E á Á é É
In v.1.5.x the engine does not verify the logical length of multi-byte character set (MBCS) strings. Hence, a UNICODE_FSS field takes three times as many characters as the declared field size, three being the maximum length of one UNICODE_FSS character.
This has been retained for compatibility for legacy character sets. However, new character sets (UTF8, for example) do not inherit this limitation.
When the character set of a CHAR or VARCHAR column is anything but NONE or OCTETS and the attachment character set is not NONE, the sqlsubtype member of an XSQLVAR pertaining to that column now contains the attachment (connection) character set number instead of the column's character set.
Several character set-related enhancements have been added for text BLOBs.
A DML COLLATE clause is now allowed with BLOBs.
Example
select blob_column from table where blob_column collate unicode = 'foo';
Comparison can be performed on the entire content of a text BLOB.
Character sets and collations are installed using a manifest file.
The manifest file should be put in the $rootdir/intl with a .conf extension. It is used to locate character sets and collations in the libraries. If a character set/collation is declared more than once, it is not loaded and the error is reported in the log.
The file /intl/fbintl.conf
is an example of a manifest file. The following snippet
is an excerpt from /intl/fbintl.conf
:
<intl_module fbintl> filename $(this)/fbintl </intl_module> <charset ISO8859_1> intl_module fbintl collation ISO8859_1 collation DA_DA collation DE_DE collation EN_UK collation EN_US collation ES_ES collation PT_BR collation PT_PT </charset> <charset WIN1250> intl_module fbintl collation WIN1250 collation PXW_CSY collation PXW_HUN collation PXW_HUNDC </charset>
The symbol $(this) is used to indicate the same directory as the manifest file and the library extension should be omitted.
Two character sets introduced in Firebird 2 will be of particular interest if you have struggled with the shortcomings of UNICODE_FSS in past versions.
The UNICODE_FSS character set has a number of problems: it's an old version of UTF8 that accepts malformed strings and does not enforce correct maximum string length. In FB 1.5.X UTF8 is an alias to UNICODE_FSS.
Now, UTF8 is a new character set, without the inherent problems of UNICODE_FSS.
UCS_BASIC works identically to UTF8 with no collation specified (sorts in UNICODE code-point order). The UNICODE collation sorts using UCA (Unicode Collation Algorithm).
Sort order sample:
isql -q -ch dos850 SQL> create database 'test.fdb'; SQL> create table t (c char(1) character set utf8); SQL> insert into t values ('a'); SQL> insert into t values ('A'); SQL> insert into t values ('á'); SQL> insert into t values ('b'); SQL> insert into t values ('B'); SQL> select * from t order by c collate ucs_basic; C ====== A B a b á SQL> select * from t order by c collate unicode; C ====== a A á b B
The 2.1 release sees further capabilities implemented for
using ICU charsets through fbintl
UNICODE collation (charset_UNICODE) being available for all fbintl charsets
using collation attributes
CREATE/DROP COLLATION statements
SHOW COLLATION and collation extraction in ISQL
Verifying that text blobs are well-formed
Transliterating text blobs automatically
All non-wide and ASCII-based character sets present in ICU can be used by Firebird 2.1. To reduce the size of the distribution kit, we customize ICU to include only essential character sets and any for which there was a specific feature request.
If the character set you need is not included, you can replace the ICU libraries with another complete module, found at our site or already installed in your operating system.
To use an alternative character set module, you need to register it in two places:
in the server's language configuration file, intl/fbintl.conf
in each database that is going to use it
Using a text editor, register the module in intl/fbintl.conf
, as follows.-
<charset NAME> intl_module fbintl collation NAME [REAL-NAME] </charset>
To register the module in a database, run the procedure sp_register_character_set
, the source for which can be found in misc/intl.sql
beneath your Firebird 2.1 root
A Sample
Here is the sample declaration in fbintl.conf
:
<charset GB> intl_module fbintl collation GB GB18030 </charset>
The stored procedure takes two arguments: a string that is the character set's identifier as declared in the configuration file and a smallint that is the maximum number of bytes a single character can occupy in the encoding. For our example:
execute procedure sp_register_character_set ('GB', 4);
Syntax for CREATE COLLATION
CREATE COLLATION <name> FOR <charset> [ FROM <base> | FROM EXTERNAL ('<name>') ] [ NO PAD | PAD SPACE ] [ CASE SENSITIVE | CASE INSENSITIVE ] [ ACCENT SENSITIVE | ACCENT INSENSITIVE ] [ '<specific-attributes>' ]
Specific attributes should be separated by semicolon and are case sensitive.
Examples
/* 1 */ CREATE COLLATION UNICODE_ENUS_CI FOR UTF8 FROM UNICODE CASE INSENSITIVE 'LOCALE=en_US'; /* 2 */ CREATE COLLATION NEW_COLLATION FOR WIN1252 PAD SPACE; /* NEW_COLLATION should be declared in .conf file in the $root/intl directory */
The UNICODE collations (case sensitive and case insensitive) can be applied to any character
set that is present in fbintl. They are already registered in fbintl.conf
,
but you need to register them in the databases, with the desired associations and
attributes.
Naming Conventions
The naming convention you should use is charset_collation
. For
example,
create collation win1252_unicode for win1252; create collation win1252_unicode_ci for win1252 from win1252_unicode case insensitive;
The character set name should be as in fbintl.conf (i.e. ISO8859_1 instead of ISO88591, for example).
Some attributes may not work with some collations, even though they do not report an error.
Disable compressions (aka contractions) changing the order of a group of characters.
Valid for collations of narrow character sets.
Format: DISABLE-COMPRESSIONS={0 | 1}
Example
DISABLE-COMPRESSIONS=1
Disable expansions changing the order of a character to sort as a group of characters.
Valid for collations of narrow character sets.
Format: DISABLE-EXPANSIONS={0 | 1}
Example
DISABLE-EXPANSIONS=1
Specify what version of ICU library will be used. Valid values are the ones defined in the config file (intl/fbintl.conf) in entry intl_module/icu_versions.
Valid for UNICODE and UNICODE_CI.
Format: ICU-VERSION={default | major.minor}
Example
ICU-VERSION=3.0
Specify the collation locale.
Valid for UNICODE and UNICODE_CI. Requires complete version of ICU libraries.
Format: LOCALE=xx_XX
Example
LOCALE=en_US
Uses more than one level for ordering purposes.
Valid for collations of narrow character sets.
Format: MULTI-LEVEL={0 | 1}
Example
MULTI-LEVEL=1
Order special characters (spaces, symbols, etc) before alphanumeric characters.
Valid for collations of narrow character sets.
Format: SPECIALS-FIRST={0 | 1}
Example
SPECIALS-FIRST=1
ES_ES (as well as the new ES_ES_CI_AI) collation automatically uses attributes DISABLE-COMPRESSIONS=1;SPECIALS-FIRST=1.
The attributes are stored at database creation time, so the changes do not apply to databases with ODS < 11.1.
The ES_ES_CI_AI collation was standardised to current usage.
Case-insensitive collation for UTF-8. See feature request CORE-972
Firebird Documentation Index → Firebird 2.1 Release Notes → International Language Support (INTL) |