mirror of
https://github.com/mozilla/gecko-dev.git
synced 2024-12-03 18:47:53 +00:00
47c3dd535d
MozReview-Commit-ID: LQicTh0fmk0 --HG-- extra : rebase_source : 7a5ee9c3242fefa72e8d0372b8e9c03170c7df4b |
||
---|---|---|
.. | ||
benches | ||
src | ||
.cargo-checksum.json | ||
.cargo-ok | ||
.travis.yml | ||
Cargo.toml | ||
COPYING | ||
ctags.rust | ||
LICENSE-MIT | ||
Makefile | ||
README.md | ||
session.vim | ||
UNLICENSE |
utf8-ranges
This crate converts contiguous ranges of Unicode scalar values to UTF-8 byte ranges. This is useful when constructing byte based automata from Unicode. Stated differently, this lets one embed UTF-8 decoding as part of one's automaton.
Dual-licensed under MIT or the UNLICENSE.
Documentation
Example
This shows how to convert a scalar value range (e.g., the basic multilingual plane) to a sequence of byte based character classes.
extern crate utf8_ranges;
use utf8_ranges::Utf8Sequences;
fn main() {
for range in Utf8Sequences::new('\u{0}', '\u{FFFF}') {
println!("{:?}", range);
}
}
The output:
[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
These ranges can then be used to build an automaton. Namely:
- Every arbitrary sequence of bytes matches exactly one of the sequences of ranges or none of them.
- Every match sequence of bytes is guaranteed to be valid UTF-8. (Erroneous encodings of surrogate codepoints in UTF-8 cannot match any of the byte ranges above.)