Go to file
2022-05-16 14:29:57 -07:00
benches Add benchmark using Criterion 2022-05-16 14:28:14 -07:00
diagram Add a program to generate diagram of the uncompressed bitmap 2022-05-16 14:21:05 -07:00
generate Generate fst using ucd-generate 2022-05-16 14:25:28 -07:00
src Might as well support no_std 2022-05-16 14:21:21 -07:00
tests Test size of static storage of all implementations 2022-05-16 14:27:41 -07:00
.gitignore Add a main.rs to generate the compressed representation 2022-05-16 14:18:35 -07:00
Cargo.toml Add benchmark using Criterion 2022-05-16 14:28:14 -07:00
LICENSE-APACHE Dual mit OR apache license 2022-05-16 14:10:55 -07:00
LICENSE-MIT Dual mit OR apache license 2022-05-16 14:10:55 -07:00
README.md Add table of benchmark results 2022-05-16 14:29:57 -07:00

Unicode ident

Implementation of Unicode Standard Annex #31 for determining which char values are valid in programming language identifiers.

This crate is a better optimized implementation of the older unicode-xid crate. This crate uses less static storage, and is able to classify both ASCII and non-ASCII codepoints with better performance, 210× faster than unicode-xid.


Comparison of performance

The following table shows a comparison between five Unicode identifier implementations.

  • unicode-ident is this crate;
  • unicode-xid is a widely used crate run by the "unicode-rs" org;
  • ucd-trie and fst are two data structures supported by the ucd-generate tool;
  • roaring is a Rust implementation of Roaring bitmap.

The static storage column shows the total size of static tables that the crate bakes into your binary, measured in 1000s of bytes.

The remaining columns show the cost per call to evaluate whether a single char has the XID_Start or XID_Continue Unicode property, comparing across different ratios of ASCII to non-ASCII codepoints in the input data.

static storage 0% nonascii 1% 10% 100% nonascii
unicode-ident 9.75 K 0.96 ns 0.95 ns 1.09 ns 1.55 ns
unicode-xid 11.34 K 1.88 ns 2.14 ns 3.48 ns 15.63 ns
ucd-trie 9.95 K 1.29 ns 1.28 ns 1.36 ns 2.15 ns
fst 133 K 55.1 ns 54.9 ns 53.2 ns 28.5 ns
roaring 66.1 K 2.78 ns 3.09 ns 3.37 ns 4.70 ns

Source code for the benchmark is provided in the bench directory of this repo and may be repeated by running cargo criterion.


License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.