mirror of
https://gitee.com/openharmony/third_party_rust_unicode-ident
synced 2024-11-27 01:31:05 +00:00
Writeup of ucd-trie crate
This commit is contained in:
parent
fdca2f0b7a
commit
56463840c6
45
README.md
45
README.md
@ -92,6 +92,51 @@ char and high char. I don't expect that performance would improve much but this
|
||||
could be the most efficient for space across all the libraries, needing only
|
||||
about 7 K to store.
|
||||
|
||||
#### ucd-trie
|
||||
|
||||
Their data structure is a compressed trie set specifically tailored for Unicode
|
||||
codepoints. The design is credited to Raph Levien in [rust-lang/rust#33098].
|
||||
|
||||
[rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098
|
||||
|
||||
```rust
|
||||
pub struct TrieSet {
|
||||
tree1_level1: &'static [u64; 32],
|
||||
tree2_level1: &'static [u8; 992],
|
||||
tree2_level2: &'static [u64],
|
||||
tree3_level1: &'static [u8; 256],
|
||||
tree3_level2: &'static [u8],
|
||||
tree3_level3: &'static [u64],
|
||||
}
|
||||
```
|
||||
|
||||
It represents codepoint sets using a trie to achieve prefix compression. The
|
||||
final states of the trie are embedded in leaves or "chunks", where each chunk is
|
||||
a 64-bit integer. Each bit position of the integer corresponds to whether a
|
||||
particular codepoint is in the set or not. These chunks are not just a compact
|
||||
representation of the final states of the trie, but are also a form of suffix
|
||||
compression. In particular, if multiple ranges of 64 contiguous codepoints have
|
||||
the same Unicode properties, then they all map to the same chunk in the final
|
||||
level of the trie.
|
||||
|
||||
Being tailored for Unicode codepoints, this trie is partitioned into three
|
||||
disjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints \[0,
|
||||
0x800), the second \[0x800, 0x10000) and the third \[0x10000, 0x110000). These
|
||||
partitions conveniently correspond to the space of 1 or 2 byte UTF-8 encoded
|
||||
codepoints, 3 byte UTF-8 encoded codepoints and 4 byte UTF-8 encoded codepoints,
|
||||
respectively.
|
||||
|
||||
Lookups in this data structure are significantly more efficient than binary
|
||||
search. A lookup touches either 1, 2, or 3 cache lines based on which of the
|
||||
trie partitions is being accessed.
|
||||
|
||||
One possible performance improvement would be for this crate to expose a way to
|
||||
query based on a UTF-8 encoded string, returning the Unicode property
|
||||
corresponding to the first character in the string. Without such an API, the
|
||||
caller is required to tokenize their UTF-8 encoded input data into `char`, hand
|
||||
the `char` into `ucd-trie`, only for `ucd-trie` to undo that work by converting
|
||||
back into the variable-length representation for trie traversal.
|
||||
|
||||
<br>
|
||||
|
||||
## License
|
||||
|
Loading…
Reference in New Issue
Block a user