This commit enables support for the perf-literal feature. When it's
disabled, no literal optimizations will be performed. Instead, only
the regex engine itself is used.
In practice, it's quite plausible that we don't need to disable *all*
literal optimizations. But that is the simplest path here, and I don't
have the stomach to do anything more with the current code. src/exec.rs
has turned into a giant soup.
The issue with the ASCII version of \B is that it can match between code
units of UTF-8, which means it can cause match indices reported to be on
invalid UTF-8 boundaries. Therefore, similar to things like `(?-u:\xFF)`,
we ban negated ASCII word boundaries from Unicode regular expressions.
Normal ASCII word boundaries remain accessible from Unicode regular
expressions.
See #457
The escaping of &, - and ~ is only required when the characters are
repeated adjacently, which should be quite rare. Escaping of [ is always
required, unless it appear in the second position of a range.
These rules enable us to add character class sets as described in
UTS#18 RL1.3 in a backward compatible way.
For `[^\x00-\xff]`, while it is still treated as a full Unicode
character class, it is not empty. For instance `≥` would still be
matched.
However, when `CharClass::to_byte_class` is called on it (as is done
when using `regex::bytes::Regex::new` rather than `regex::Regex::new`),
it _is_ now empty, since it excludes all possible bytes.
This commit adds a test asserting that `regex::bytes::Regex::new`
returns `Err` for this case (in accordance with
https://github.com/rust-lang-nursery/regex/issues/106) and adds an
`is_empty` check to the result of calling `CharClass::to_byte_class`,
which allows the test to pass.
If a haystack was invalid UTF-8 (which is allowed to by searched using a
`bytes::Regex`), then ^/$ handling was incorrect. Namely, the ^/$
handling assumed that failing to decode a codepoint from the haystack
meant that the position was either at the beginning or end of the string
(which is true if the haystack is guaranteed to be valid UTF-8).
Instead, we should query the position directly instead of relying on the
encoding properties of the haystack.
Fixes#277.
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)
Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.
This PR makes a few other changes out of convenience:
1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.
Closes#85.