mirror of
https://github.com/mozilla/gecko-dev.git
synced 2024-10-31 22:25:30 +00:00
244 lines
8.2 KiB
Plaintext
244 lines
8.2 KiB
Plaintext
#
|
|
# $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
|
|
#
|
|
|
|
CHARACTER DATA
|
|
==============
|
|
|
|
This package generates some data files that contain character properties useful
|
|
for text processing.
|
|
|
|
CHARACTER PROPERTIES
|
|
====================
|
|
|
|
The first data file is called "ctype.dat" and contains a compressed form of
|
|
the character properties found in the Unicode Character Database (UCDB).
|
|
Additional properties can be specified in limited UCDB format in another file
|
|
to avoid modifying the original UCDB.
|
|
|
|
The following is a property name and code table to be used with the character
|
|
data:
|
|
|
|
NAME CODE DESCRIPTION
|
|
---------------------
|
|
Mn 0 Mark, Non-Spacing
|
|
Mc 1 Mark, Spacing Combining
|
|
Me 2 Mark, Enclosing
|
|
Nd 3 Number, Decimal Digit
|
|
Nl 4 Number, Letter
|
|
No 5 Number, Other
|
|
Zs 6 Separator, Space
|
|
Zl 7 Separator, Line
|
|
Zp 8 Separator, Paragraph
|
|
Cc 9 Other, Control
|
|
Cf 10 Other, Format
|
|
Cs 11 Other, Surrogate
|
|
Co 12 Other, Private Use
|
|
Cn 13 Other, Not Assigned
|
|
Lu 14 Letter, Uppercase
|
|
Ll 15 Letter, Lowercase
|
|
Lt 16 Letter, Titlecase
|
|
Lm 17 Letter, Modifier
|
|
Lo 18 Letter, Other
|
|
Pc 19 Punctuation, Connector
|
|
Pd 20 Punctuation, Dash
|
|
Ps 21 Punctuation, Open
|
|
Pe 22 Punctuation, Close
|
|
Po 23 Punctuation, Other
|
|
Sm 24 Symbol, Math
|
|
Sc 25 Symbol, Currency
|
|
Sk 26 Symbol, Modifier
|
|
So 27 Symbol, Other
|
|
L 28 Left-To-Right
|
|
R 29 Right-To-Left
|
|
EN 30 European Number
|
|
ES 31 European Number Separator
|
|
ET 32 European Number Terminator
|
|
AN 33 Arabic Number
|
|
CS 34 Common Number Separator
|
|
B 35 Block Separator
|
|
S 36 Segment Separator
|
|
WS 37 Whitespace
|
|
ON 38 Other Neutrals
|
|
Pi 47 Punctuation, Initial
|
|
Pf 48 Punctuation, Final
|
|
#
|
|
# Implementation specific properties.
|
|
#
|
|
Cm 39 Composite
|
|
Nb 40 Non-Breaking
|
|
Sy 41 Symmetric (characters which are part of open/close pairs)
|
|
Hd 42 Hex Digit
|
|
Qm 43 Quote Mark
|
|
Mr 44 Mirroring
|
|
Ss 45 Space, Other (controls viewed as spaces in ctype isspace())
|
|
Cp 46 Defined character
|
|
|
|
The actual binary data is formatted as follows:
|
|
|
|
Assumptions: unsigned short is at least 16-bits in size and unsigned long
|
|
is at least 32-bits in size.
|
|
|
|
unsigned short ByteOrderMark
|
|
unsigned short OffsetArraySize
|
|
unsigned long Bytes
|
|
unsigned short Offsets[OffsetArraySize + 1]
|
|
unsigned long Ranges[N], N = value of Offsets[OffsetArraySize]
|
|
|
|
The Bytes field provides the total byte count used for the Offsets[] and
|
|
Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and
|
|
there is always one extra node on the end to hold the final index of the
|
|
Ranges[] array. The Ranges[] array contains pairs of 4-byte values
|
|
representing a range of Unicode characters. The pairs are arranged in
|
|
increasing order by the first character code in the range.
|
|
|
|
Determining if a particular character is in the property list requires a
|
|
simple binary search to determine if a character is in any of the ranges
|
|
for the property.
|
|
|
|
If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
|
|
machine with a different endian order and the values must be byte-swapped.
|
|
|
|
To swap a 16-bit value:
|
|
c = (c >> 8) | ((c & 0xff) << 8)
|
|
|
|
To swap a 32-bit value:
|
|
c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
|
|
(((c >> 16) & 0xff) << 8) | (c >> 24)
|
|
|
|
CASE MAPPINGS
|
|
=============
|
|
|
|
The next data file is called "case.dat" and contains three case mapping tables
|
|
in the following order: upper, lower, and title case. Each table is in
|
|
increasing order by character code and each mapping contains 3 unsigned longs
|
|
which represent the possible mappings.
|
|
|
|
The format for the binary form of these tables is:
|
|
|
|
unsigned short ByteOrderMark
|
|
unsigned short NumMappingNodes, count of all mapping nodes
|
|
unsigned short CaseTableSizes[2], upper and lower mapping node counts
|
|
unsigned long CaseTables[NumMappingNodes]
|
|
|
|
The starting indexes of the case tables are calculated as following:
|
|
|
|
UpperIndex = 0;
|
|
LowerIndex = CaseTableSizes[0] * 3;
|
|
TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
|
|
|
|
The order of the fields for the three tables are:
|
|
|
|
Upper case
|
|
----------
|
|
unsigned long upper;
|
|
unsigned long lower;
|
|
unsigned long title;
|
|
|
|
Lower case
|
|
----------
|
|
unsigned long lower;
|
|
unsigned long upper;
|
|
unsigned long title;
|
|
|
|
Title case
|
|
----------
|
|
unsigned long title;
|
|
unsigned long upper;
|
|
unsigned long lower;
|
|
|
|
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
|
|
same way as described in the CHARACTER PROPERTIES section.
|
|
|
|
Because the tables are in increasing order by character code, locating a
|
|
mapping requires a simple binary search on one of the 3 codes that make up
|
|
each node.
|
|
|
|
It is important to note that there can only be 65536 mapping nodes which
|
|
divided into 3 portions allows 21845 nodes for each case mapping table. The
|
|
distribution of mappings may be more or less than 21845 per table, but only
|
|
65536 are allowed.
|
|
|
|
DECOMPOSITIONS
|
|
==============
|
|
|
|
The next data file is called "decomp.dat" and contains the decomposition data
|
|
for all characters with decompositions containing more than one character and
|
|
are *not* compatibility decompositions. Compatibility decompositions are
|
|
signaled in the UCDB format by the use of the <compat> tag in the
|
|
decomposition field. Each list of character codes represents a full
|
|
decomposition of a composite character. The nodes are arranged in increasing
|
|
order by character code.
|
|
|
|
The format for the binary form of this table is:
|
|
|
|
unsigned short ByteOrderMark
|
|
unsigned short NumDecompNodes, count of all decomposition nodes
|
|
unsigned long Bytes
|
|
unsigned long DecompNodes[(NumDecompNodes * 2) + 1]
|
|
unsigned long Decomp[N], N = sum of all counts in DecompNodes[]
|
|
|
|
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
|
|
same way as described in the CHARACTER PROPERTIES section.
|
|
|
|
The DecompNodes[] array consists of pairs of unsigned longs, the first of
|
|
which is the character code and the second is the initial index of the list
|
|
of character codes representing the decomposition.
|
|
|
|
Locating the decomposition of a composite character requires a binary search
|
|
for a character code in the DecompNodes[] array and using its index to
|
|
locate the start of the decomposition. The length of the decomposition list
|
|
is the index in the following element in DecompNode[] minus the current
|
|
index.
|
|
|
|
COMBINING CLASSES
|
|
=================
|
|
|
|
The fourth data file is called "cmbcl.dat" and contains the characters with
|
|
non-zero combining classes.
|
|
|
|
The format for the binary form of this table is:
|
|
|
|
unsigned short ByteOrderMark
|
|
unsigned short NumCCLNodes
|
|
unsigned long Bytes
|
|
unsigned long CCLNodes[NumCCLNodes * 3]
|
|
|
|
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
|
|
same way as described in the CHARACTER PROPERTIES section.
|
|
|
|
The CCLNodes[] array consists of groups of three unsigned longs. The first
|
|
and second are the beginning and ending of a range and the third is the
|
|
combining class of that range.
|
|
|
|
If a character is not found in this table, then the combining class is
|
|
assumed to be 0.
|
|
|
|
It is important to note that only 65536 distinct ranges plus combining class
|
|
can be specified because the NumCCLNodes is usually a 16-bit number.
|
|
|
|
NUMBER TABLE
|
|
============
|
|
|
|
The final data file is called "num.dat" and contains the characters that have
|
|
a numeric value associated with them.
|
|
|
|
The format for the binary form of the table is:
|
|
|
|
unsigned short ByteOrderMark
|
|
unsigned short NumNumberNodes
|
|
unsigned long Bytes
|
|
unsigned long NumberNodes[NumNumberNodes]
|
|
unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
|
|
/ sizeof(short)]
|
|
|
|
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
|
|
same way as described in the CHARACTER PROPERTIES section.
|
|
|
|
The NumberNodes array contains pairs of values, the first of which is the
|
|
character code and the second an index into the ValueNodes array. The
|
|
ValueNodes array contains pairs of integers which represent the numerator
|
|
and denominator of the numeric value of the character. If the character
|
|
happens to map to an integer, both the values in ValueNodes will be the
|
|
same.
|