For `[^\x00-\xff]`, while it is still treated as a full Unicode
character class, it is not empty. For instance `≥` would still be
matched.
However, when `CharClass::to_byte_class` is called on it (as is done
when using `regex::bytes::Regex::new` rather than `regex::Regex::new`),
it _is_ now empty, since it excludes all possible bytes.
This commit adds a test asserting that `regex::bytes::Regex::new`
returns `Err` for this case (in accordance with
https://github.com/rust-lang-nursery/regex/issues/106) and adds an
`is_empty` check to the result of calling `CharClass::to_byte_class`,
which allows the test to pass.
If a haystack was invalid UTF-8 (which is allowed to by searched using a
`bytes::Regex`), then ^/$ handling was incorrect. Namely, the ^/$
handling assumed that failing to decode a codepoint from the haystack
meant that the position was either at the beginning or end of the string
(which is true if the haystack is guaranteed to be valid UTF-8).
Instead, we should query the position directly instead of relying on the
encoding properties of the haystack.
Fixes#277.
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)
Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.
This PR makes a few other changes out of convenience:
1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.
Closes#85.