mirror of
https://gitee.com/openharmony/third_party_rust_regex
synced 2025-04-09 14:11:38 +00:00

In other words, `\b` in a `bytes::Regex` can now be used in the DFA. This leads to a big performance boost: ``` sherlock::word_ending_n 115,465,261 (5 MB/s) 3,038,621 (195 MB/s) -112,426,640 -97.37% ``` Unfortunately, Unicode word boundaries continue to elude the DFA. This state of affairs is lamentable, but after a lot of thought, I've concluded there are only two ways to speed up Unicode word boundaries: 1. Come up with a hairbrained scheme to add multi-byte look-behind/ahead to the lazy DFA. (The theory says it's possible. Figuring out how to do this without combinatorial state explosion is not within my grasp at the moment.) 2. Build a second lazy DFA with transitions on Unicode codepoints instead of bytes. (The looming inevitability of this makes me queasy for a number of reasons.) To ameliorate this state of affairs, it is now possible to disable Unicode support in `Regex::new` with `(?-u)`. In other words, one can now use an ASCII word boundary with `(?-u:\b)`. Disabling Unicode support does not violate any invariants around UTF-8. In particular, if the regular expression could lead to a match of invalid UTF-8, then the parser will return an error. (This only happens for `Regex::new`. `bytes::Regex::new` still of course allows matching arbitrary bytes.) Finally, a new `PERFORMANCE.md` guide was written.