3.5Character Data Types

For working with character data, Firebird has the fixed-length CHAR and the variable-length VARCHAR data types. The maximum size of text data stored in these data types is 32,767 bytes for CHAR and 32,765 bytes for VARCHAR. The maximum number of characters that will fit within these limits depends on the character set used for the data under consideration. The collation does not affect this maximum, although it may affect the maximum size of any index that involves the column.

If no character set is explicitly specified when defining a character object, the default character set of the database — at time of defining the object — will be used. If the database does not have a default character set defined, the object gets the character set NONE.

3.5.1Unicode

Most current development tools support Unicode, implemented in Firebird with the character sets UTF8 and UNICODE_FSS. UTF8 comes with collations for many languages. UNICODE_FSS is more limited and was previously used mainly by Firebird internally for storing metadata. Keep in mind that one UTF8 character occupies up to 4 bytes, thus limiting the size of CHAR fields to 8,191 characters (32,767/4).

The actual bytes per character value depends on the range the character belongs to. Non-accented Latin letters occupy 1 byte, Cyrillic letters from the WIN1251 encoding occupy 2 bytes in UTF8, characters from other encodings may occupy up to 4 bytes.

The UTF8 character set implemented in Firebird supports the latest version of the Unicode standard, thus recommending its use for international databases.

3.5.2Client Character Set

While working with strings, it is essential to keep the character set of the client connection in mind. If there is a mismatch between the character sets of the stored data and that of the client connection, the output results for string columns are automatically re-encoded, both when data are sent from the client to the server and when they are sent back from the server to the client. For example, if the database was created in the WIN1251 encoding but KOI8R or UTF8 is specified in the client’s connection parameters, the mismatch will be transparent.

3.5.3Special Character Sets

Character set NONEThe character set NONE is a special character set in Firebird. It can be characterized such that each byte is a part of a string, but the string is stored in the system without any clues about what constitutes any character: character encoding, collation, case, etc. are simply unknown. It is the responsibility of the client application to deal with the data and provide the means to interpret the string of bytes in some way that is meaningful to the application and the human user.

Character set OCTETSData in OCTETS encoding are treated as bytes that may not be interpreted as characters. OCTETS provides a way to store binary data. The database engine has no concept of what it is meant to do with a string of bytes in OCTETS, other than store and retrieve it. Again, the client side is responsible for validating the data, presenting them in formats that are meaningful to the application and its users and handling any exceptions arising from decoding and encoding them. CHAR and VARCHAR with character set OCTETS have synonyms BINARY and VARBINARY.

3.5.4Collation

Each character set has a default collation (COLLATE) that specifies the collation order (or, collation sequence, or collating sequence). Usually, it provides nothing more than ordering based on the numeric code of the characters and a basic mapping of upper- and lower-case characters. If some behaviour is needed for strings that is not provided by the default collation and a suitable alternative collation is supported for that character set, a COLLATE collation clause can be specified in the column definition.

A COLLATE collation clause can be applied in other contexts besides the column definition. For comparison operations, it can be added in the WHERE clause of a SELECT statement. If output needs to be sorted in a special alphabetic sequence, or case-insensitively, and the appropriate collation exists, then a COLLATE clause can be included with the ORDER BY clause when rows are being sorted on a character field and with the GROUP BY clause in case of grouping operations.

3.5.4.1Case-Insensitive Searching

For a case-insensitive search, the UPPER function could be used to convert both the search argument and the searched strings to upper-case before attempting a match:

  |...
  |where upper(name) = upper(:flt_name)

For strings in a character set that has a case-insensitive collation available, you can apply the collation, to compare the search argument and the searched strings directly. For example, using the WIN1251 character set, the collation PXW_CYRL is case-insensitive for this purpose:

  |...
  |WHERE FIRST_NAME COLLATE PXW_CYRL >= :FLT_NAME
  |...
  |ORDER BY NAME COLLATE PXW_CYRL

See alsoCONTAINING

3.5.4.2UTF8 Collations

The following table shows the possible collations for the UTF8 character set.

Table 3.9Collations for Character Set UTF8
CollationCharacteristics

UCS_BASIC

Collation works according to the position of the character in the table (binary).

UNICODE

Collation works according to the UCA algorithm (Unicode Collation Algorithm) (alphabetical).

UTF8

The default, binary collation, identical to UCS_BASIC, which was added for SQL compatibility

UNICODE_CI

Case-insensitive collation, works without taking character case into account.

UNICODE_CI_AI

Case-insensitive, accent-insensitive collation, works alphabetically without taking character case or accents into account.

ExampleAn example of collation for the UTF8 character set without taking into account the case or accentuation of characters (similar to COLLATE PXW_CYRL in the earlier example).

  |...
  |ORDER BY NAME COLLATE UNICODE_CI_AI

3.5.5Character Indexes

The maximum length for an index key equals one quarter of the page size, i.e. from 1,024 — for page size 4,096 — to 8,192 bytes — for page size 32,768. The maximum length of an indexed string is 9 bytes less than that quarter-page limit.

The table below shows the maximum length of an indexed string (in characters), according to page size and character set, calculated using this formula.

Table 3.10Maximum Index Lengths by Page Size and Character Size

Page Size

Bytes per character

1

2

3

4

6

4,096

1,015

507

338

253

169

8,192

2,039

1,019

679

509

339

16,384

4,087

2,043

1,362

1,021

681

32,768

8,183

4,091

2,727

2,045

1,363

With case-insensitive collations (_CI), one character in the index key will occupy not 4, but 6 (six) bytes, so the maximum key length for a page of — for example — 4,096 bytes, will be 169 characters.

See alsoCREATE DATABASE, Collation, SELECT, WHERE, GROUP BY, ORDER BY

3.5.6Character Types in Detail

3.5.6.1BINARY

Data Type Declaration Format

  |BINARY [(length)]

Table 3.11BINARY Type Parameters
ParameterDescription

length

Length in bytes between 1 and 32,767; defaults to 1.

BINARY is a fixed-length binary data type, and is an SQL standard-compliant alias for CHAR(length) CHARACTER SET OCTETS. Values shorter than the declared length are padded with NUL (0x00) up to the declared length.

Note

Some tools may report the type as CHAR CHARACTER SET OCTETS instead of BINARY.

See alsoSection 3.5.6.2, “CHAR, Section 3.5.6.3, “VARBINARY

3.5.6.2CHAR

Data Type Declaration Format

  |{CHAR | CHARACTER} [(length)]
  |  [CHARACTER SET <set>] [COLLATE <name>]

Table 3.12CHAR Type Parameters
ParameterDescription

length

Length in characters, defaults to 1. A valid length is from 1 to the maximum number of characters that can be accommodated within 32,767 bytes.

set

Character set name

name

Collation name

CHAR is a fixed-length character data type. Values shorter than the declared length are padded with spaces up to the declared length. The pad character does not have to be a space (0x20): it depends on the character set. For example, the pad character for the OCTETS character set is NUL (0x00).

Fixed-length character data can be used to store codes whose length is standard and has a definite width. An example of such a code is an EAN13 barcode — 13 characters, all filled.

CHAR CHARACTER SET OCTETS has the alias Section 3.5.6.1, “BINARY.

Formally, the COLLATE clause is not part of the data type declaration, and its position depends on the syntax of the statement.

See alsoSection 3.5.6.1, “BINARY, Section 3.5.6.4, “VARCHAR

3.5.6.3VARBINARY

Data Type Declaration Format

  |{VARBINARY | BINARY VARYING} (length)

Table 3.13VARBINARY Type Parameters
ParameterDescription

length

Length in bytes between 1 and 32,765

VARBINARY is a variable-length binary type, and is an SQL standard-compliant alias for VARCHAR(length) CHARACTER SET OCTETS.

Note

Some tools may report the type as VARCHAR CHARACTER SET OCTETS instead of VARBINARY.

See alsoSection 3.5.6.4, “VARCHAR, Section 3.5.6.1, “BINARY

3.5.6.4VARCHAR

Data Type Declaration Format

  |{VARCHAR | {CHAR | CHARACTER} VARYING} (length)
  |  [CHARACTER SET <set>] [COLLATE <name>]

Table 3.14VARCHAR Type Parameters
ParameterDescription

length

Length in characters. A valid length is from 1 to the maximum number of characters that can be accommodated within 32,765 bytes.

set

Character set name

name

Collation name

VARCHAR is a variable-length character data type, up to a maximum of 32,765 bytes. The stored structure is equal to the actual size of the data plus 2 bytes to record the length of the data.

All characters that are sent from the client application to the database are considered meaningful, including leading and trailing spaces.

VARCHAR CHARACTER SET OCTETS has the alias Section 3.5.6.3, “VARBINARY.

Formally, the COLLATE clause is not part of the data type declaration, and its position depends on the syntax of the statement.

See alsoSection 3.5.6.3, “VARBINARY, Section 3.5.6.2, “CHAR

3.5.6.5NCHAR

Data Type Declaration Format

  |{NCHAR | NATIONAL {CHAR | CHARACTER}} [(length)]

NCHAR is a fixed-length character data type with the ISO8859_1 character set. In all other respects it is the same as CHAR.

If no length is specified, it is taken to be 1.

A similar data type is available for the variable-length string type: NATIONAL {CHAR | CHARACTER} VARYING.

See alsoSection 3.5.6.2, “CHAR, Section 3.5.6.4, “VARCHAR