third_party_rust_unicode-normalization

openharmony/third_party_rust_unicode-normalization

mirror of https://github.com/openharmony/third_party_rust_unicode-normalization.git synced 2026-07-01 21:33:59 -04:00

Author	SHA1	Message	Date
peizhe	a00d190d9a	Add GN Build Files and Custom Modifications Signed-off-by: peizhe <472708703@qq.com>	2023-04-18 18:33:06 +08:00
Manish Goregaokar	e2273e9741	Merge pull request #89 from crlf0710/master Update to Unicode 15 and bump version to 0.1.22 github.com/unicode-rs/unicode-normalization/refs/tags/v0.1.22	2022-09-16 08:36:47 -07:00
Charles Lew	3bff41ebcb	Update to Unicode 15 and bump version to 0.1.22	2022-09-16 23:22:08 +08:00
Manish Goregaokar	a077fd79f2	Release 0.1.21	2022-07-01 09:04:24 -07:00
Manish Goregaokar	f9c5485af0	Merge pull request #88 from theodore-s-beers/unicode-14 Update to Unicode 14	2022-07-01 09:03:56 -07:00
Theo Beers	240929c088	Update to Unicode 14	2022-07-01 14:41:03 +02:00
Manish Goregaokar	d53895e530	Merge pull request #87 from glandium/0.1.20 Bump version to 0.1.20	2022-06-24 07:38:20 -07:00
Mike Hommey	df233b44f7	Bump version to 0.1.20	2022-06-24 07:53:28 +09:00
Manish Goregaokar	b2fdf0bade	Merge pull request #86 from glandium/tables Avoid slices in entries of decomposition tables	2022-06-23 15:20:43 -07:00
Mike Hommey	0923b90948	Avoid slices in entries of decomposition tables For small arrays of data, slices are expensive: - there's the obvious size of the array (`sizeof::<char>() * length`) - there's the size of the slice itself (`sizeof::<(const str, usize)>()`) - there's the size of the relocation for the pointer in the slice. The worst case is on 64-bits ELF, where it is `3 sizeof::<usize>()`(!). Most entries in decomposition tables are 2 characters or less, so the overhead for each of these tables is incredibly large. To give an idea, a print "Hello, World" (fresh from cargo new) executable built with `--release` on my machine has 17712 bytes of .rela.dyn (relocations) and 9520 bytes of .data.rel.ro (relocatable read-only data). Adding unicode-normalization as a dependency and changing the code to `println!("{}", String::from_iter("Hello, world!".nfc()));`, those jump to, respectively, 156336 and 147968 bytes. For comparison, with unicode-normalization 0.1.8 (last release before the perfect hashes), they were 18168 and 9872 bytes. This is however compensated by the .text (code) being larger (314607 with 0.1.8 vs. 234639 with 0.1.19); likewise for .rodata (non-relocatable read-only data) (225979 with 0.1.8, vs. 82523 with 0.1.19). This can be alleviated by replacing slices with indexes into a unique slice per decomposition table, overall saving 228K (while barely adding to code size (160 bytes)). This also makes the overall cost of unicode-normalization lower than what it was in 0.1.8. As far as performance is concerned, at least on my machine, it makes virtually no difference on `cargo bench`: on master: running 22 tests test bench_is_nfc_ascii ... bench: 13 ns/iter (+/- 0) test bench_is_nfc_normalized ... bench: 23 ns/iter (+/- 0) test bench_is_nfc_not_normalized ... bench: 347 ns/iter (+/- 2) test bench_is_nfc_stream_safe_ascii ... bench: 13 ns/iter (+/- 0) test bench_is_nfc_stream_safe_normalized ... bench: 31 ns/iter (+/- 0) test bench_is_nfc_stream_safe_not_normalized ... bench: 374 ns/iter (+/- 2) test bench_is_nfd_ascii ... bench: 9 ns/iter (+/- 0) test bench_is_nfd_normalized ... bench: 29 ns/iter (+/- 2) test bench_is_nfd_not_normalized ... bench: 9 ns/iter (+/- 0) test bench_is_nfd_stream_safe_ascii ... bench: 16 ns/iter (+/- 0) test bench_is_nfd_stream_safe_normalized ... bench: 40 ns/iter (+/- 0) test bench_is_nfd_stream_safe_not_normalized ... bench: 9 ns/iter (+/- 0) test bench_nfc_ascii ... bench: 525 ns/iter (+/- 1) test bench_nfc_long ... bench: 186,528 ns/iter (+/- 1,613) test bench_nfd_ascii ... bench: 283 ns/iter (+/- 30) test bench_nfd_long ... bench: 120,183 ns/iter (+/- 4,510) test bench_nfkc_ascii ... bench: 513 ns/iter (+/- 1) test bench_nfkc_long ... bench: 192,922 ns/iter (+/- 1,673) test bench_nfkd_ascii ... bench: 276 ns/iter (+/- 30) test bench_nfkd_long ... bench: 137,163 ns/iter (+/- 2,159) test bench_streamsafe_adversarial ... bench: 323 ns/iter (+/- 5) test bench_streamsafe_ascii ... bench: 25 ns/iter (+/- 0) with patch applied: running 22 tests test bench_is_nfc_ascii ... bench: 13 ns/iter (+/- 0) test bench_is_nfc_normalized ... bench: 23 ns/iter (+/- 0) test bench_is_nfc_not_normalized ... bench: 347 ns/iter (+/- 7) test bench_is_nfc_stream_safe_ascii ... bench: 13 ns/iter (+/- 0) test bench_is_nfc_stream_safe_normalized ... bench: 36 ns/iter (+/- 1) test bench_is_nfc_stream_safe_not_normalized ... bench: 377 ns/iter (+/- 14) test bench_is_nfd_ascii ... bench: 9 ns/iter (+/- 0) test bench_is_nfd_normalized ... bench: 29 ns/iter (+/- 3) test bench_is_nfd_not_normalized ... bench: 10 ns/iter (+/- 0) test bench_is_nfd_stream_safe_ascii ... bench: 16 ns/iter (+/- 0) test bench_is_nfd_stream_safe_normalized ... bench: 39 ns/iter (+/- 1) test bench_is_nfd_stream_safe_not_normalized ... bench: 10 ns/iter (+/- 0) test bench_nfc_ascii ... bench: 545 ns/iter (+/- 2) test bench_nfc_long ... bench: 186,348 ns/iter (+/- 1,660) test bench_nfd_ascii ... bench: 281 ns/iter (+/- 2) test bench_nfd_long ... bench: 124,720 ns/iter (+/- 5,967) test bench_nfkc_ascii ... bench: 517 ns/iter (+/- 4) test bench_nfkc_long ... bench: 194,943 ns/iter (+/- 1,636) test bench_nfkd_ascii ... bench: 274 ns/iter (+/- 0) test bench_nfkd_long ... bench: 127,973 ns/iter (+/- 1,161) test bench_streamsafe_adversarial ... bench: 320 ns/iter (+/- 3) test bench_streamsafe_ascii ... bench: 25 ns/iter (+/- 0)	2022-06-23 18:10:18 +09:00
Manish Goregaokar	664130397f	Merge pull request #82 from Xaeroxe/fix-75 Fix #75, implement UnicodeNormalization for char	2021-10-07 20:58:24 -07:00
Jacob Kiesel	e9b2c4499e	Fix #75 , implement UnicodeNormalization for char	2021-10-07 14:46:14 -06:00
Manish Goregaokar	3ed26b3d29	Merge pull request #81 from sunfishcode/main Publish 0.1.19	2021-06-02 08:32:30 -07:00
Dan Gohman	a8892812df	Publish 0.1.19	2021-06-01 15:31:35 -07:00
Manish Goregaokar	f7666c1ac3	Merge pull request #80 from sunfishcode/main Fix `is_public_assigned` to include Hangul Syllable and other ranges.	2021-06-01 15:27:57 -07:00
Dan Gohman	33e73008da	Fix `is_public_assigned` to include Hangul Syllable and other ranges. Hangul Syllables and several other ranges are defined in UnicodeData.txt as just their first and last values: ``` AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;; D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;; ``` Teach the unicode.py script how to recognize these, so that it correctly classifies them as assigned ranges, for the `is_public_assigned` predicate.	2021-06-01 14:42:29 -07:00
Manish Goregaokar	74f416f8ea	Merge pull request #79 from sunfishcode/main Publish 0.1.18	2021-05-28 11:44:23 -07:00
Dan Gohman	8f31969fc7	Publish 0.1.18	2021-05-28 10:50:58 -07:00
Manish Goregaokar	67b5460819	Merge pull request #78 from sunfishcode/main Add an `is_public_assigned` predicate.	2021-05-28 09:26:11 -07:00
Dan Gohman	aaa72a31da	Add an `is_public_assigned` predicate. Add an `is_public_assigned` predicate, which tests whether a given `char` is assigned (`General_Category` != `Unassigned`) in the currently supported version of Unicode, and not Private-Use (`General_Category` != `Private_Use`). This comes up in some use cases sensitive to the stability of NFC over Unicode version changes. An unassigned codepoint could become assigned in the future, and new normalizations could apply to it. For further details, see - <https://unicode.org/reports/tr15/#Versioning>	2021-05-28 09:13:34 -07:00
Manish Goregaokar	58f3f963f1	Merge pull request #77 from sunfishcode/main Handle multiple starters in the stream-safe fuzzer.	2021-05-18 09:00:14 -07:00
Dan Gohman	479c6f0da9	Handle multiple starters in the stream-safe fuzzer. NFC compositions can involve multiple starters, such as `\u{11347}` and `\u{11357}`. Adjust the counting iterator in the streaming fuzzer to only count non-starters, so that it doesn't over-count. Fixes #76.	2021-05-17 21:20:24 -07:00
Manish Goregaokar	06ad2d82de	Fix cd to package	2021-02-08 17:06:21 -08:00
Manish Goregaokar	01884c9261	Merge pull request #72 from sunfishcode/main Publish 0.1.17	2021-02-08 17:01:09 -08:00
Dan Gohman	209762718c	Publish 0.1.17	2021-02-08 10:02:55 -08:00
Sujay Jayakar	b388aa7e07	Merge pull request #70 from sunfishcode/ext Add new normalization algorithms using Standardized Variants	2021-01-06 10:05:16 -08:00
Dan Gohman	ad609637a3	Fix the CJK Compat Variants decomp stats string. Also, remove the non-`fully` `cjk_compat_variants_decomp` map, since it's no longer used.	2021-01-06 10:01:25 -08:00
Dan Gohman	dba17f95a2	Use `ArrayVec` to panic instead of resizing on overflow.	2021-01-06 09:54:32 -08:00
Dan Gohman	0083e1014e	Rename `svar` to `cjk_compat_variants`.	2021-01-05 22:51:46 -08:00
Dan Gohman	052e6d7367	Avoid saying "non-standard" in a comment. The standardized variations sequences are standardized, so don't imply otherwise.	2021-01-04 07:35:50 -08:00
Dan Gohman	6b376dbea8	Don't decompose Hangul in the `svar` iterator.	2020-12-07 11:56:16 -08:00
Dan Gohman	f362213cfb	Switch to a more explicit API. Switch to a dedicated `svar()` iterator function, which just does standardized variation sequences, rather than framing this functionality as an open-ended "extended" version of the standard normalization algorithms. This makes for a more factored API, gives users more control over exactly what transformations are done, and has less impact on users that don't need this new functionality.	2020-12-06 08:26:50 -08:00
Dan Gohman	107879735b	Add new normalization algorithms using Standardized Variants The standard normalization algorithm decomposes CJK compatibility ideographs into nominally equivalent codepoints, but which traditionally look different, and is one of the main reasons normalization is considered destructive in practice. [Unicode 6.3] introduced a solution for this, by providing [standardized variation sequences] for these codepoints. For example, while U+2F8A6 "CJK COMPATIBILITY-IDEOGRAPH-2F8A6" canonically decomposes to U+6148 with a different appearance, in Unicode 6.3 and later the standardized variation sequences in the StandardizedVariants.txt file include the following: > 6148 FE00; CJK COMPATIBILITY IDEOGRAPH-2F8A6; which says that "CJK COMPATIBILITY IDEOGRAPH-2F8A6" corresponds to U+6148 U+FE00, where U+FE00 is "VARIATION SELECTOR-1". U+6148 and U+FE00 are both normalized codepoints, so we can transform text containing U+2F8A6 into normal form without losing information about the distinct appearance. At this time, many popular implementations ignore these variation selectors, however this technique at least preserves the information in a standardized way, so implementations could use it if they chose. This PR adds "ext" versions of the `nfd`, `nfc`, `nfkd`, and `nkfd` iterators, which perform the standard algorithms extended with this technique. They don't match the standard decompositions, and don't guarantee stability, but they do produce appropriately normalized output. I used the generic term "ext" to reflect that other extensions could theoretically be added in the future. The standard decomposition tables are limited by their stability requirements, but these "ext" versions could be free to adopt new useful rules. I'm not an expert in any of these topics, so please correct me if I'm mistaken in any of this. Also, I'm open to ideas about how to best present this functionality in the API. [Unicode 6.3]: https://www.unicode.org/versions/Unicode6.3.0/#Summary [standardized variation sequences]: http://unicode.org/faq/vs.html	2020-12-06 08:18:20 -08:00
Sujay Jayakar	8dfab5ee50	Remove dependency on `format!` in `test_all_nonstarters`	2020-11-30 15:21:35 -08:00
Sujay Jayakar	e9210364d8	Update nonstarter_count correctly + add test for all nonstarters string	2020-11-30 15:21:35 -08:00
Manish Goregaokar	fd4997b126	Add github actions	2020-11-30 15:21:35 -08:00
Manish Goregaokar	69a16a17f6	Publish 0.1.16	2020-11-18 13:33:10 -08:00
Manish Goregaokar	933ee7948f	Merge pull request #64 from sunfishcode/sunfishcode/nfd-buffering Make the decompose iterator avoid buffering elements past a starter.	2020-11-18 13:32:05 -08:00
Manish Goregaokar	6138a5beb1	Merge pull request #65 from sunfishcode/sunfishcode/stream-safe-reset Reset the stream-safe position when a starter is seen.	2020-11-18 13:31:32 -08:00
Manish Goregaokar	8d93152948	Merge pull request #66 from sunfishcode/master Minor cleanups after #63	2020-11-18 13:30:30 -08:00
Dan Gohman	49654fcb12	Make the decompose iterator avoid buffering elements past a starter. Once the decompose iterator sees a starter, it should immediately start returning characters from the preceeding sequence. If the input happens to be stream-safe, it should never get more than MAX_NONSTARTERS + plus boundary values ahead of its inner iterator.	2020-11-18 11:16:13 -08:00
Dan Gohman	43948859e4	Reset the stream-safe position when a starter is seen. This avoids inserting too many CGJs.	2020-11-18 11:15:42 -08:00
Dan Gohman	06dc0cc429	Fix an unused import warning in a test.	2020-11-18 10:48:39 -08:00
Dan Gohman	960ac7fd7d	Use a local path dependency for `unicode-normalization`. And update the libfuzzer-sys dependency while here.	2020-11-18 10:30:11 -08:00
Dan Gohman	00e2834d24	Use `assert_ne!` instead of `assert!` and `!=`.	2020-11-18 10:29:32 -08:00
Manish Goregaokar	b3b473748c	Merge pull request #63 from sunfishcode/master Add a fuzz target.	2020-11-17 22:06:24 -08:00
Dan Gohman	8a24c56ae7	Add a fuzz target.	2020-11-17 21:58:20 -08:00
Manish Goregaokar	96e66358f7	Bump version, add self as co-maintainer	2020-11-17 19:45:32 -08:00
Manish Goregaokar	b8bc682b9b	Merge pull request #62 from unicode-rs/streamsafe-reset Correctly reset streamsafe iterator	2020-11-17 19:43:35 -08:00
Manish Goregaokar	ff51b1fae1	Add test for streamsafe iterator	2020-11-17 19:39:42 -08:00

1 2 3 4

183 Commits