gecko-dev/intl/docs/icu.rst

289 lines
22 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

###
ICU
###
Introduction
============
Internationalization (i18n, “i” then 18 letters then “n”) is the process of handling data with respect to a particular locale:
- The number 5 representing five US dollars might be formatted as
- “$5.00” in American English,
- “US$5.00” in Canadian English, or
- “5,00 $US” in French.
- A list of peoples names in a phone book would sort
- in English alphabetically; but
- in German, where “ä”/“ö”/“ü” are often interchangeable with “ae”/“oe”/“ue”, alphabetically but with vowels with umlauts treated as their two-vowel counterparts.
- The currency whose code is “CHF” might be formatted as
- “Swiss Franc” in English, but
- “franc suisse” in French.
- The Unix time 1590803313070 might format as the time string
- “9:48:33 PM Eastern Daylight Time” in American English, but
- “21:48:33 Nordamerikanische Ostküsten-Sommerzeit” in German.
i18n encompasses far more than this, but you get the basic idea.
Internationalization in SpiderMonkey and Gecko
==============================================
SpiderMonkey implements extensive i18n capabilities through the `ECMAScript Internationalization API <https://tc39.es/ecma402/>`__ and the global ``Intl`` object. Gecko requires i18n capabilities to implement text shaping, sort operations in some contexts, and various other features.
SpiderMonkey and Gecko use `ICU <http://site.icu-project.org/>`__, Internationalization Components for Unicode, to implement many low-level i18n operations. (Line breaking, implemented instead in ``intl/lwbrk``, is a notable exception.) Gecko and SpiderMonkey also use ICUs implementations of certain i18n-*adjacent* operations (for example, Unicode normalization).
ICU date/time formatting functionality requires extensive knowledge of time zone names and when zone transitions occur. The IANA ``tzdata`` database supplies this information.
A final note of caution: ICU carefully depends upon an exact Unicode version. Other parts of SpiderMonkey and Gecko have separate dependencies on an exact Unicode version. Updates to ICU and related components *must* be synchronized with those updates so that the entirety of SpiderMonkey, and the entirety of Gecko including SpiderMonkey within it, advance to new Unicode versions in lockstep. [#lockstep]_
.. [#lockstep]
The steps involved in updating Gecko-in-generals Unicode version, and updating SpiderMonkeys code dependent on Unicode version, are `documented on WikiMO <https://wiki.mozilla.org/I18n:Updating_Unicode_version>`__.
Building SpiderMonkey or Gecko with ICU
=======================================
SpiderMonkey and Gecko can be built using either a periodically-updated copy of ICU in ``intl/icu/source`` (using time zone data in ``intl/tzdata/source``), or using a system-provided ICU library (dependent on its own ``tzdata`` information). Pass ``--with-system-icu`` when configuring to use system ICU. (Using system ICU will disable some ``Intl`` functionality, such as historically accurate time zone calculations, that cant be readily supported without a precisely-controlled ICU.) ICU version requirements advance fairly quickly as Gecko depends on features and bug fixes in newer ICU releases. Youll get a build error if you try to use an unsupported ICU.
SpiderMonkeys ``Intl`` API may be built or disabled by configuring ``--with-intl-api`` (the default) or ``--without-intl-api``. SpiderMonkey built without the ``Intl`` API doesnt require ICU. However, if you build without the ``Intl`` API, some non-``Intl`` JavaScript functionality will not exist (``String.prototype.normalize``) or wont fully work (for example, ``String.prototype.toLocale{Lower,Upper}Case`` will not respect a provided locale, and the various ``toLocaleString`` functions have best-effort behavior).
Using ICU functionality in SpiderMonkey and Gecko
=================================================
ICU headers are considered system headers by the Gecko build system, so they must be listed in ``config/system-headers.mozbuild``. Code that wishes to use ICU functionality may use ``#include "unicode/unorm.h"`` or similar to do so.
Gecko and SpiderMonkey code may use ICUs stable C API (ICU4C). These functions are stable and shouldnt change as ICU updates occur. (ICU4Cs ``enum`` initializers are not always stable: while initializer values are stable, new initializers are sometimes added, perhaps behind ``#ifdef U_HIDE_DRAFT_API``. This may be necessary for exhaustive ``switch``\ es to add ``#ifdef``\ s around some ``case``\ s.)
Gecko and SpiderMonkey are strongly discouraged from using ICUs C++ API (unfortunately including all smart pointer classes), because the C++ API doesnt provide ICU4Cs compatibility guarantees. Rarely, we tolerate C++ API use when no stable option exists. But the API has to “look” reasonably stable, and we usually want to start a discussion with upstream about adding a stable API to eventually use. Use symbols from ``namespace icu`` to access ICU C++ functionality. *Talk to the current imported-ICU owner (presently Jeff Walden) before you start doing any of this!*
SpiderMonkey and Geckos imported ICU
=====================================
Build system
------------
The system for building ICU lives in ``config/external/icu`` and ``intl/icu/icu_sources_data.py``. We generate a Mozilla-compatible build system rather than using ICUs build system. The build system is shared by SpiderMonkey and Gecko both.
ICU includes functionality we never use, so we dont naively compile all of it. We extract the list of files to compile from ``intl/icu/source/{common,i18n}/Makefile.in`` and then apply a manually-maintained list of unused files (stored in ``intl/icu_sources_data.py``) when we update ICU.
Locale and time zone data
-------------------------
ICU contains a considerable amount of raw locale data: formatting characteristics for each locale, strings for things like currencies and languages for each locale, localized time zone specifiers, and so on. This data lives in human-readable files in ``intl/icu/source/data``. Time zone data in ``intl/tzdata/source`` is stored in partially-compiled formats (some of them only partly human-readable).
However, a normal Gecko build never uses these files! Instead, both ICU and ``tzdata`` data are precompiled into a large, endian-specific ``icudtNNE.dat`` (``NN`` = ICU version, ``E`` = endianness) file. [#why-icudt-not-rebuilt-every-time]_ That file is added to ``config/external/icu/data/`` and is checked into the Mozilla tree, to be directly incorporated into Gecko/SpiderMonkey builds. For size reasons, only the little-endian version is checked into the tree. It is converted into a big-endian version when necessary during the build.
ICUs locale data covers *all* ICU internationalization features, including ones we never need. We trim locale data to size with a ``intl/icu/data_filter.json`` `data filter <https://github.com/unicode-org/icu/blob/master/docs/userguide/icu_data/buildtool.md>`__ when compiling ``icudtNNE.dat``. Removing *too much* data wont *necessarily* break the build, so its important that we have automated tests for the locale data we actually use in order to detect mistakes.
.. [#why-icudt-not-rebuilt-every-time]
``icudtNNE.dat`` isnt compiled during a SpiderMonkey/Gecko build because it would require ICU command-line tools. And its a pain to either compile and run them during the build, or to require them as build dependencies.
Local patching of ICU
---------------------
We generally dont patch our copy of ICU except for compelling need. When we do patch, we usually only apply reasonably small patches that have been reviewed and landed upstream (so that our patch will be obsolete when we next update ICU).
Local patches are stored in the ``intl/icu-patches`` directory. Theyre applied when ICU is updated, so merely updating ICU files in place wont persist changes across an ICU update.
Updating imported code
----------------------
The process of updating imported i18n-relevant code is *semi*-automated. We use a series of shell and Python scripts to do the job.
Updating ICU
~~~~~~~~~~~~
New ICU versions are announced on the `icu-announce <https://lists.sourceforge.net/lists/listinfo/icu-announce>`__ mailing list. Both release candidates and actual releases are announced here. Its a good idea to attempt to update ICU when a release candidate is announced, just in case some serious problem is present (especially one that would be painful to fix through local patching).
``intl/update-icu.sh`` updates our ICU to a given ICU release: [#icu-git-argument]_
.. code:: bash
$ cd "$topsrcdir/intl"
$ # Ensure certain Python modules in the tree are accessible when updating.
$ export PYTHONPATH="$topsrcdir/python/mozbuild/"
$ # <URL to ICU Git> <release tag name>
$ ./update-icu.sh https://github.com/unicode-org/icu.git release-67-1
.. [#icu-git-argument]
The ICU Git URL argument lets you update from a local ICU clone. This can speed up work when youre updating to a new ICU release and need to adjust or add new local patches.
But usually youll want to update to the latest commit from the corresponding ICU maintenance branch so that you pick up fixes landed post-release:
.. code:: bash
$ cd "$topsrcdir/intl"
$ # Ensure certain Python modules in the tree are accessible when updating.
$ export PYTHONPATH="$topsrcdir/python/mozbuild/"
$ # <URL to ICU Git> <maintenance name>
$ ./update-icu.sh https://github.com/unicode-org/icu.git maint/maint-67
Updating ICU will also update the language tag registry (which records language tag semantics needed to correctly implement ``Intl`` functionality). Therefore its likely necessary to update SpiderMonkeys language tag handling after running this [#update-icu-warning-langtags]_. See below where the ``langtags`` mode of ``make_intl_data.py`` is discussed.
.. [#update-icu-warning-langtags]
``update-icu.sh`` will print a notice as a reminder of this:
.. code:: bash
INFO: Please run 'js/src/builtin/intl/make_intl_data.py langtags' to update additional language tag files for SpiderMonkey.
``update-icu.sh`` is intended for *replayability*, not for hands-off runnability. It downloads ICU source, prunes various irrelevant files, replaces ``intl/icu/source`` with the new files and then blindly applies local patches in fixed order.
Often a local patch wont apply, or new patches must be applied to successfully build. In this case youll have to manually edit ``update-icu.sh`` to abort after only *some* patches have been applied, make whatever changes are necessary by hand, generate a new/updated patch file by hand, then carefully reattempt updating. (The people who have updated ICU in the past, usually jwalden and anba, follow this awkward process and dont have good ideas on how to improve it.)
Any time ICU is updated, youll need to fully rebuild whichever of SpiderMonkey or Gecko youre building. For SpiderMonkey, delete your object directory and reconfigure from scratch. For Gecko, change the message in the top-level `CLOBBER <https://searchfox.org/mozilla-central/source/CLOBBER>`__ file.
Updating tzdata
~~~~~~~~~~~~~~~
ICU contains a copy of ``tzdata``, but that copy is whatever ``tzdata`` release was current at the time the ICU release was finalized. Time zone data changes much more often than that: every time some national legislature or tinpot dictator decides to alter time zones. [#tzdata-release-frequency]_ The `tz-announce <https://mm.icann.org/pipermail/tz-announce/>`__ mailing list announces changes as they occur. (Note that we cant *immediately* update when a release occurs: ICUs `icu-data <https://github.com/unicode-org/icu-data>`__ repository must be updated before we can update our ``tzdata``.)
.. [#tzdata-release-frequency]
To give a sense of how frequently ``tzdata`` is updated, and the irregularity of releases over time:
- 2019 had three ``tzdata`` releases, 2019a through 2019c.
- 2018 had nine ``tzdata`` releases, 2018a through 2018i.
- 2017 had three ``tzdata`` releases, 2017a through 2017c.
Therefore, either (usually) after you update ICU *or* when a new ``tzdata`` release occurs, youll need to update our imported ``tzdata`` files. (If you do need to update time zone data, note that youll also need to additionally update SpiderMonkeys time zone handling, described further below.) This also suitably updates ``config/external/icu/data/icudtNNE.dat``. (If youve just run ``update-icu.sh``, it will warn you that you need to do this. [#update-icu-warning-old-tzdata]_)
.. [#update-icu-warning-old-tzdata]
For example:
::
WARN: Local tzdata (2020a) is newer than ICU tzdata (2019c), please run './update-tzdata.sh 2020a'
First, make sure you have a usable ``icupkg`` on your system. [#icupkg-on-system]_ Then run the ``update-tzdata.sh`` script to update ``intl/tzdata`` and ``icudtNNE.dat``:
.. code:: bash
$ cd "$topsrcdir/intl"
$ ./update-tzdata.sh 2020a # or whatever the latest release is
.. [#icupkg-on-system]
To install ``icupkg`` on your system:
- On Fedora, use ``sudo dnf install icu``.
- On Ubuntu, use ``sudo apt-get install icu-devtools``.
- On Mac OS X, use ``brew install icu4c``.
- On Windows, youll need to `download a binary build of ICU for Windows <https://github.com/unicode-org/icu/releases/tag/release-67-1>`__ and use the ``bin/icupkg.exe`` or ``bin64/icupkg.exe`` utility inside it.
If youre on Windows, or for some reason you dont want to use the ``icupkg`` now in your ``$PATH``, you can manually specify it on the command line using the ``-e /path/to/icupkg`` flag:
.. code:: bash
$ cd "$topsrcdir/intl"
$ ./update-tzdata.sh -e /path/to/icupkg 2020a # or whatever the latest release is
*In principle*, the ``icupkg`` you use *should* be the one from the ICU release/maintenance branch being built: if theres a mismatch, you might encounter an ICU “format version not supported” error. If youre on Windows, make sure to download a binary build for that release/branch. On other platforms, you might have to build your own ICU from source. The steps required to do this are left as an exercise for the reader. (In the somewhat longer term, the update commands might be changed to do this themselves.)
If ``tzdata`` must be updated on trunk, youll almost certainly have to backport the update to Beta and ESR. Dont attempt to backport the literal patch; just run the appropriate commands documented here to do so.
Updating SpiderMonkey ``Intl`` data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SpiderMonkey itself cant blindly invoke ICU to perform every i18n operation, because sometimes ICU behavior deviates from what web specifications require. Therefore, when ICU is updated, we also must update SpiderMonkey itself as well (including various generated tests). Such updating is performed using the various modes of ``js/src/builtin/make_intl_data.py``.
Updating SpiderMonkey time zone handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ECMAScript Internationalization API requires that time zone identifiers (``America/New_York``, ``Antarctica/McMurdo``, etc.) be interpreted according to `IANA <https://www.iana.org/time-zones>`__ semantics. Unfortunately, ICU doesnt precisely implement those semantics. (See comments in ``js/src/builtin/intl/SharedIntlData.h`` for details.) Therefore SpiderMonkey has to do certain pre- and post-processing based on whats in IANA but not in ICU, and whats in ICU that isnt in IANA.
Use ``make_intl_data.py``\ s ``tzdata`` mode to update time zone information:
.. code:: bash
$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py tzdata
The ``tzdata`` mode accepts two optional arguments that generally will not be needed:
- **``--tz``** will act using data from a local ``tzdata/`` directory containing raw ``tzdata`` source (note that this is *not* the same as what is in ``intl/tzdata/source``). It may be useful to help debug problems that arise during an update.
- **``--ignore-backzone``** will omit time zone information before 1970. SpiderMonkey and Gecko include this information by default. However, because (by deliberate policy) ``tzdata`` information before 1970 is not reliable to the same degree as data since 1970, and backzone data has a size cost, a SpiderMonkey embedding or custom Gecko build might decide to omit it.
Updating SpiderMonkey language tag handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Language tags (``en``, ``de-CH``, ``ar-u-ca-islamicc``, and so on) are the primary means of specifying localization characteristics. The ECMAScript Internationalization API supports certain operations that depend upon the current state of the language tag registry (stored in the Unicode Common Locale Data Repository, CLDR, a repository of all locale-specific characteristics) that specifies subtag semantics:
- ``Intl.getCanonicalLocales`` and ``Intl.Locale`` must replace alias subtags with their preferred forms. For example, ``ar-u-ca-islamic-civil`` uses the preferred Islamic calendar subtag, while ``ar-u-ca-islamicc`` uses an alias.
- ``Intl.Locale.prototype.maximize`` and ``Intl.Locale.prototype.minimize`` accept a language tag and add or remove “likely” subtags from it. For example, ``de`` most likely refers to German using Latin script in Germany, so it maximizes to ``de-Latn-DE`` and in reverse, ``de-Latn-DE`` minimizes to simply ``de``.
These decisions vary over time: as countries change [#soviet-union]_, as customs change, as language prevalence in regions varies, etc.
.. [#soviet-union]
For just one relevant example, the breakup of the Soviet Union is the cause of numerous entries in the language tag registry. ``ru-SU``, Russian as used in the Soviet Union, is now expressed as ``ru-RU``, Russian as used in Russia; ``ab-SU``, Abkhazian as used in the Soviet Union, is now expressed as ``ab-GE``, Abkhazian as used in Georgia; and so on for all the other satellite states.
Use ``make_intl_data.py``\ s ``langtags`` mode to update language tag information to the same CLDR version used by ICU:
.. code:: bash
$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py langtags
The CLDR version used will be printed in the header of CLDR-sensitive generated files. For example, ``js/src/builtin/intl/LanguageTagGenerated.cpp`` currently begins with:
.. code:: cpp
// Generated by make_intl_data.py. DO NOT EDIT.
// Version: CLDR-37
// URL: https://unicode.org/Public/cldr/37/core.zip
Updating SpiderMonkey currency support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Currencies use different numbers of fractional digits in their preferred formatting. Most currencies use two decimal digits; a handful use no fractional digits or some other number. Currency fractional digit is maintained by ISO and must be updated as currencies change their preferred fractional digits or new currencies arise that dont use two decimal digits.
Currency updates are fairly uncommon, so itll be rare to need to update currency info. A `newsletter <https://www.currency-iso.org/en/home/amendments/newsletter.html>`__ periodically sends updates about changes.
Use ``make_intl_data.py``\ s ``currency`` mode to update currency fractional digit information:
.. code:: bash
$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py currency
Updating SpiderMonkey measurement formatting support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``Intl`` API supports formatting numbers as measurement units (for example, “17 meters” or “42 meters per second”). It specifies a list of units that must be supported, that we centrally record in ``js/src/builtin/intl/SanctionedSimpleUnitIdentifiers.yaml``, that we verify are supported by ICU and generate supporting files from.
If ``Intl``\ s list of supported units is ever updated, two separate changes will be required.
First, ``intl/icu/data_filter.json`` must be updated to incorporate localized strings for the new unit. These strings are stored in ``icudtNNE.dat``, so youll have to re-update ICU (and likely reimport ``tzdata`` as well, if its been updated since the last ICU update) to rewrite that file.
Second, use ``make_intl_data.py``\ s ``units`` mode to update unit handling and associated tests in SpiderMonkey:
.. code:: bash
$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py units
Updating SpiderMonkey numbering systems support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``Intl`` API also supports formatting numbers in various numbering systems (for example, “123“ using Latin numbers or “一二三“ using Han decimal numbers). The list of numbering systems that we must support is stored in ``js/src/builtin/intl/NumberingSystems.yaml``. We verify these numbering systems are supported by ICU and generate supporting files from it.
When the list of supported numbering systems needs to be updated, run ``make_intl_data.py`` with the ``numbering`` mode to update it and associated tests in SpiderMonkey:
.. code:: bash
$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py numbering