third_party_rust_regex/tests/test_default_bytes.rs
Andrew Gallant ab72269b23 Add ASCII word boundaries to the lazy DFA.
In other words, `\b` in a `bytes::Regex` can now be used in the DFA.
This leads to a big performance boost:

```
sherlock::word_ending_n                  115,465,261 (5 MB/s)  3,038,621 (195 MB/s)    -112,426,640  -97.37%
```

Unfortunately, Unicode word boundaries continue to elude the DFA. This
state of affairs is lamentable, but after a lot of thought, I've
concluded there are only two ways to speed up Unicode word boundaries:

1. Come up with a hairbrained scheme to add multi-byte look-behind/ahead
   to the lazy DFA. (The theory says it's possible. Figuring out how to
   do this without combinatorial state explosion is not within my grasp
   at the moment.)
2. Build a second lazy DFA with transitions on Unicode codepoints
   instead of bytes. (The looming inevitability of this makes me queasy
   for a number of reasons.)

To ameliorate this state of affairs, it is now possible to disable
Unicode support in `Regex::new` with `(?-u)`. In other words, one can
now use an ASCII word boundary with `(?-u:\b)`.

Disabling Unicode support does not violate any invariants around UTF-8.
In particular, if the regular expression could lead to a match of
invalid UTF-8, then the parser will return an error. (This only happens
for `Regex::new`. `bytes::Regex::new` still of course allows matching
arbitrary bytes.)

Finally, a new `PERFORMANCE.md` guide was written.
2016-04-08 22:58:10 -04:00

59 lines
1.2 KiB
Rust

// Copyright 2014-2015 The Rust Project Developers. See the COPYRIGHT
// file at the top-level directory of this distribution and at
// http://rust-lang.org/COPYRIGHT.
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.
extern crate rand;
extern crate regex;
macro_rules! regex_new {
($re:expr) => {{
use regex::bytes::Regex;
Regex::new($re)
}}
}
macro_rules! regex_set_new {
($res:expr) => {{
use regex::bytes::RegexSet;
RegexSet::new($res)
}}
}
macro_rules! regex {
($re:expr) => {
regex_new!($re).unwrap()
}
}
macro_rules! regex_set {
($res:expr) => {
regex_set_new!($res).unwrap()
}
}
// Must come before other module definitions.
include!("macros_bytes.rs");
include!("macros.rs");
mod api;
mod bytes;
mod crazy;
mod flags;
mod fowler;
mod multiline;
mod noparse;
mod regression;
mod replace;
mod set;
mod shortest_match;
mod suffix_reverse;
mod unicode;
mod word_boundary;
mod word_boundary_ascii;