mirror of
https://gitee.com/openharmony/third_party_rust_regex
synced 2025-04-14 16:39:56 +00:00

This commit refactors the way this library handles Unicode data by making it completely optional. Several features are introduced which permit callers to select only the Unicode data they need (up to a point of granularity). An important property of these changes is that presence of absence of crate features will never change the match semantics of a regular expression. Instead, the presence or absence of a crate feature can only add or subtract from the set of all possible valid regular expressions. So for example, if the `unicode-case` feature is disabled, then attempting to produce `Hir` for the regex `(?i)a` will fail. Instead, callers must use `(?i-u)a` (or enable the `unicode-case` feature). This partially addresses #583 since it permits callers to decrease binary size.
64 lines
2.3 KiB
Bash
Executable File
64 lines
2.3 KiB
Bash
Executable File
#!/bin/sh
|
|
|
|
# This script is responsible for generating some of the Unicode tables used
|
|
# in regex-syntax.
|
|
#
|
|
# Usage is simple, first download the Unicode data:
|
|
#
|
|
# $ mkdir ucd
|
|
# $ cd ucd
|
|
# $ curl -LO https://www.unicode.org/Public/zipped/12.1.0/UCD.zip
|
|
# $ unzip UCD.zip
|
|
# $ curl -LO https://unicode.org/Public/emoji/12.0/emoji-data.txt
|
|
#
|
|
# And then run this script from the root of this repository by pointing it at
|
|
# the data directory downloaded above:
|
|
#
|
|
# $ ./scripts/generate-unicode-tables path/to/ucd
|
|
|
|
if [ $# != 1 ]; then
|
|
echo "Usage: $(basename "$0") <ucd-data-directory>" >&2
|
|
exit 1
|
|
fi
|
|
ucddir="$1"
|
|
|
|
out="regex-syntax/src/unicode_tables"
|
|
ucd-generate age "$ucddir" \
|
|
--chars > "$out/age.rs"
|
|
ucd-generate case-folding-simple "$ucddir" \
|
|
--chars --all-pairs > "$out/case_folding_simple.rs"
|
|
ucd-generate general-category "$ucddir" \
|
|
--chars --exclude surrogate > "$out/general_category.rs"
|
|
ucd-generate grapheme-cluster-break "$ucddir" \
|
|
--chars > "$out/grapheme_cluster_break.rs"
|
|
ucd-generate property-bool "$ucddir" \
|
|
--chars > "$out/property_bool.rs"
|
|
ucd-generate property-names "$ucddir" \
|
|
> "$out/property_names.rs"
|
|
ucd-generate property-values "$ucddir" \
|
|
--include gc,script,scx,age,gcb,wb,sb > "$out/property_values.rs"
|
|
ucd-generate script "$ucddir" \
|
|
--chars > "$out/script.rs"
|
|
ucd-generate script-extension "$ucddir" \
|
|
--chars > "$out/script_extension.rs"
|
|
ucd-generate sentence-break "$ucddir" \
|
|
--chars > "$out/sentence_break.rs"
|
|
ucd-generate word-break "$ucddir" \
|
|
--chars > "$out/word_break.rs"
|
|
|
|
# These generate the \w, \d and \s Unicode-aware character classes. \d and \s
|
|
# are technically part of the general category and boolean properties generated
|
|
# above. However, these are generated separately to make it possible to enable
|
|
# or disable them via Cargo features independently of whether all boolean
|
|
# properties or general categories are enabled or disabled. The crate ensures
|
|
# that only one copy is compiled.
|
|
ucd-generate perl-word "$ucddir" \
|
|
--chars > "$out/perl_word.rs"
|
|
ucd-generate general-category "$ucddir" \
|
|
--chars --include decimalnumber > "$out/perl_decimal.rs"
|
|
ucd-generate property-bool "$ucddir" \
|
|
--chars --include whitespace > "$out/perl_space.rs"
|
|
|
|
# Make sure everything is formatted.
|
|
cargo +stable fmt --all
|