For small arrays of data, slices are expensive:
- there's the obvious size of the array (`sizeof::<char>() * length`)
- there's the size of the slice itself (`sizeof::<(*const str, usize)>()`)
- there's the size of the relocation for the pointer in the slice. The
worst case is on 64-bits ELF, where it is `3 * sizeof::<usize>()`(!).
Most entries in decomposition tables are 2 characters or less, so the
overhead for each of these tables is incredibly large.
To give an idea, a print "Hello, World" (fresh from cargo new)
executable built with `--release` on my machine has 17712 bytes of
.rela.dyn (relocations) and 9520 bytes of .data.rel.ro (relocatable
read-only data).
Adding unicode-normalization as a dependency and changing the code to
`println!("{}", String::from_iter("Hello, world!".nfc()));`,
those jump to, respectively, 156336 and 147968 bytes.
For comparison, with unicode-normalization 0.1.8 (last release before
the perfect hashes), they were 18168 and 9872 bytes. This is however
compensated by the .text (code) being larger (314607 with 0.1.8 vs.
234639 with 0.1.19); likewise for .rodata (non-relocatable read-only
data) (225979 with 0.1.8, vs. 82523 with 0.1.19).
This can be alleviated by replacing slices with indexes into a unique
slice per decomposition table, overall saving 228K (while barely adding
to code size (160 bytes)). This also makes the overall cost of
unicode-normalization lower than what it was in 0.1.8.
As far as performance is concerned, at least on my machine, it makes
virtually no difference on `cargo bench`:
on master:
running 22 tests
test bench_is_nfc_ascii ... bench: 13 ns/iter (+/- 0)
test bench_is_nfc_normalized ... bench: 23 ns/iter (+/- 0)
test bench_is_nfc_not_normalized ... bench: 347 ns/iter (+/- 2)
test bench_is_nfc_stream_safe_ascii ... bench: 13 ns/iter (+/- 0)
test bench_is_nfc_stream_safe_normalized ... bench: 31 ns/iter (+/- 0)
test bench_is_nfc_stream_safe_not_normalized ... bench: 374 ns/iter (+/- 2)
test bench_is_nfd_ascii ... bench: 9 ns/iter (+/- 0)
test bench_is_nfd_normalized ... bench: 29 ns/iter (+/- 2)
test bench_is_nfd_not_normalized ... bench: 9 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_ascii ... bench: 16 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_normalized ... bench: 40 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_not_normalized ... bench: 9 ns/iter (+/- 0)
test bench_nfc_ascii ... bench: 525 ns/iter (+/- 1)
test bench_nfc_long ... bench: 186,528 ns/iter (+/- 1,613)
test bench_nfd_ascii ... bench: 283 ns/iter (+/- 30)
test bench_nfd_long ... bench: 120,183 ns/iter (+/- 4,510)
test bench_nfkc_ascii ... bench: 513 ns/iter (+/- 1)
test bench_nfkc_long ... bench: 192,922 ns/iter (+/- 1,673)
test bench_nfkd_ascii ... bench: 276 ns/iter (+/- 30)
test bench_nfkd_long ... bench: 137,163 ns/iter (+/- 2,159)
test bench_streamsafe_adversarial ... bench: 323 ns/iter (+/- 5)
test bench_streamsafe_ascii ... bench: 25 ns/iter (+/- 0)
with patch applied:
running 22 tests
test bench_is_nfc_ascii ... bench: 13 ns/iter (+/- 0)
test bench_is_nfc_normalized ... bench: 23 ns/iter (+/- 0)
test bench_is_nfc_not_normalized ... bench: 347 ns/iter (+/- 7)
test bench_is_nfc_stream_safe_ascii ... bench: 13 ns/iter (+/- 0)
test bench_is_nfc_stream_safe_normalized ... bench: 36 ns/iter (+/- 1)
test bench_is_nfc_stream_safe_not_normalized ... bench: 377 ns/iter (+/- 14)
test bench_is_nfd_ascii ... bench: 9 ns/iter (+/- 0)
test bench_is_nfd_normalized ... bench: 29 ns/iter (+/- 3)
test bench_is_nfd_not_normalized ... bench: 10 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_ascii ... bench: 16 ns/iter (+/- 0)
test bench_is_nfd_stream_safe_normalized ... bench: 39 ns/iter (+/- 1)
test bench_is_nfd_stream_safe_not_normalized ... bench: 10 ns/iter (+/- 0)
test bench_nfc_ascii ... bench: 545 ns/iter (+/- 2)
test bench_nfc_long ... bench: 186,348 ns/iter (+/- 1,660)
test bench_nfd_ascii ... bench: 281 ns/iter (+/- 2)
test bench_nfd_long ... bench: 124,720 ns/iter (+/- 5,967)
test bench_nfkc_ascii ... bench: 517 ns/iter (+/- 4)
test bench_nfkc_long ... bench: 194,943 ns/iter (+/- 1,636)
test bench_nfkd_ascii ... bench: 274 ns/iter (+/- 0)
test bench_nfkd_long ... bench: 127,973 ns/iter (+/- 1,161)
test bench_streamsafe_adversarial ... bench: 320 ns/iter (+/- 3)
test bench_streamsafe_ascii ... bench: 25 ns/iter (+/- 0)
Hangul Syllables and several other ranges are defined in UnicodeData.txt
as just their first and last values:
```
AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
```
Teach the unicode.py script how to recognize these, so that it correctly
classifies them as assigned ranges, for the `is_public_assigned`
predicate.
Add an `is_public_assigned` predicate, which tests whether a given
`char` is assigned (`General_Category` != `Unassigned`) in the currently
supported version of Unicode, and not Private-Use (`General_Category`
!= `Private_Use`).
This comes up in some use cases sensitive to the stability of NFC over
Unicode version changes. An unassigned codepoint could become assigned in
the future, and new normalizations could apply to it.
For further details, see
- <https://unicode.org/reports/tr15/#Versioning>
NFC compositions can involve multiple starters, such as `\u{11347}` and
`\u{11357}`. Adjust the counting iterator in the streaming fuzzer to
only count non-starters, so that it doesn't over-count.
Fixes#76.
Switch to a dedicated `svar()` iterator function, which just does
standardized variation sequences, rather than framing this functionality
as an open-ended "extended" version of the standard normalization
algorithms. This makes for a more factored API, gives users more control
over exactly what transformations are done, and has less impact on users
that don't need this new functionality.
The standard normalization algorithm decomposes CJK compatibility ideographs
into nominally equivalent codepoints, but which traditionally look different,
and is one of the main reasons normalization is considered destructive in
practice.
[Unicode 6.3] introduced a solution for this, by providing
[standardized variation sequences] for these codepoints. For example, while
U+2F8A6 "CJK COMPATIBILITY-IDEOGRAPH-2F8A6" canonically decomposes to U+6148
with a different appearance, in Unicode 6.3 and later the standardized variation
sequences in the StandardizedVariants.txt file include the following:
> 6148 FE00; CJK COMPATIBILITY IDEOGRAPH-2F8A6;
which says that "CJK COMPATIBILITY IDEOGRAPH-2F8A6" corresponds to
U+6148 U+FE00, where U+FE00 is "VARIATION SELECTOR-1".
U+6148 and U+FE00 are both normalized codepoints, so we can transform text
containing U+2F8A6 into normal form without losing information about the
distinct appearance. At this time, many popular implementations ignore these
variation selectors, however this technique at least preserves the information
in a standardized way, so implementations could use it if they chose.
This PR adds "ext" versions of the `nfd`, `nfc`, `nfkd`, and `nkfd`
iterators, which perform the standard algorithms extended with this technique.
They don't match the standard decompositions, and don't guarantee stability,
but they do produce appropriately normalized output.
I used the generic term "ext" to reflect that other extensions could
theoretically be added in the future. The standard decomposition tables are
limited by their stability requirements, but these "ext" versions could be
free to adopt new useful rules.
I'm not an expert in any of these topics, so please correct me if I'm mistaken
in any of this. Also, I'm open to ideas about how to best present this
functionality in the API.
[Unicode 6.3]: https://www.unicode.org/versions/Unicode6.3.0/#Summary
[standardized variation sequences]: http://unicode.org/faq/vs.html
Once the decompose iterator sees a starter, it should immediately start
returning characters from the preceeding sequence. If the input happens
to be stream-safe, it should never get more than MAX_NONSTARTERS + plus
boundary values ahead of its inner iterator.