mirror of
https://gitee.com/openharmony/third_party_rust_regex
synced 2025-04-08 21:54:20 +00:00

In other words, `\b` in a `bytes::Regex` can now be used in the DFA. This leads to a big performance boost: ``` sherlock::word_ending_n 115,465,261 (5 MB/s) 3,038,621 (195 MB/s) -112,426,640 -97.37% ``` Unfortunately, Unicode word boundaries continue to elude the DFA. This state of affairs is lamentable, but after a lot of thought, I've concluded there are only two ways to speed up Unicode word boundaries: 1. Come up with a hairbrained scheme to add multi-byte look-behind/ahead to the lazy DFA. (The theory says it's possible. Figuring out how to do this without combinatorial state explosion is not within my grasp at the moment.) 2. Build a second lazy DFA with transitions on Unicode codepoints instead of bytes. (The looming inevitability of this makes me queasy for a number of reasons.) To ameliorate this state of affairs, it is now possible to disable Unicode support in `Regex::new` with `(?-u)`. In other words, one can now use an ASCII word boundary with `(?-u:\b)`. Disabling Unicode support does not violate any invariants around UTF-8. In particular, if the regular expression could lead to a match of invalid UTF-8, then the parser will return an error. (This only happens for `Regex::new`. `bytes::Regex::new` still of course allows matching arbitrary bytes.) Finally, a new `PERFORMANCE.md` guide was written.
59 lines
1.2 KiB
Rust
59 lines
1.2 KiB
Rust
// Copyright 2014-2015 The Rust Project Developers. See the COPYRIGHT
|
|
// file at the top-level directory of this distribution and at
|
|
// http://rust-lang.org/COPYRIGHT.
|
|
//
|
|
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
|
|
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
|
|
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
|
|
// option. This file may not be copied, modified, or distributed
|
|
// except according to those terms.
|
|
|
|
extern crate rand;
|
|
extern crate regex;
|
|
|
|
macro_rules! regex_new {
|
|
($re:expr) => {{
|
|
use regex::bytes::Regex;
|
|
Regex::new($re)
|
|
}}
|
|
}
|
|
|
|
macro_rules! regex_set_new {
|
|
($res:expr) => {{
|
|
use regex::bytes::RegexSet;
|
|
RegexSet::new($res)
|
|
}}
|
|
}
|
|
|
|
macro_rules! regex {
|
|
($re:expr) => {
|
|
regex_new!($re).unwrap()
|
|
}
|
|
}
|
|
|
|
macro_rules! regex_set {
|
|
($res:expr) => {
|
|
regex_set_new!($res).unwrap()
|
|
}
|
|
}
|
|
|
|
// Must come before other module definitions.
|
|
include!("macros_bytes.rs");
|
|
include!("macros.rs");
|
|
|
|
mod api;
|
|
mod bytes;
|
|
mod crazy;
|
|
mod flags;
|
|
mod fowler;
|
|
mod multiline;
|
|
mod noparse;
|
|
mod regression;
|
|
mod replace;
|
|
mod set;
|
|
mod shortest_match;
|
|
mod suffix_reverse;
|
|
mod unicode;
|
|
mod word_boundary;
|
|
mod word_boundary_ascii;
|