Specifically, given the strings ABCX, CDAX and BCX, it was reporting
the unambiguous set as A, BCX and CDAX, which is wrong since A is a
substring of CDAX.
unambiguous_prefixes is now quite a bit of a mess, but so is the rest
of the literal extraction code. The only thing it has going for it is
a massive test suite.
Fixes#289
In particular, if a range started or ended with `-`, then the pretty
printer would naively print `-` as if it were a normal character. For
example, `[--/]` corresponds to the class `[-/]` since the first `-` is
treated as a literal `-` and not part of any range.
There are a couple ways to fix this. For now, we fix the pretty printer
so that it never prints a `-` as part of a range.
This bug was found through ripgrep:
https://github.com/BurntSushi/ripgrep/issues/156
The specific problem here is that our literal search doesn't know about
anchors, so it will try to search all of the detected literals in a regex.
In a regex like `a|^b`, the literal `b` should only be searched for at the
beginning of the haystack and in no other place.
The right way to fix this is probably to make the literal detector smarter,
but the literal detector is already too complex. Instead, this commit
detects whether a regex is partially anchored (that is, when the regex has
at least one matchable sub-expression that is anchored), and if so,
disables the literal engine.
Note that this doesn't disable all literal optimizations, just the
optimization that opts out of regex engines entirely. Both the DFA and the
NFA will still use literal prefixes to search. Namely, if it searches and
finds a literal that needs to be anchored but isn't in the haystack, then
the regex engine rules it out as a false positive.
Fixes#268.
This commit fixes a bug where matching (?-u:\B) (that is, "not an ASCII
word boundary") in the NFA engines could produce match positions at invalid
UTF-8 sequence boundaries. The specific problem is that determining whether
(?-u:\B) matches or not relies on knowing whether we must report matches
only at UTF-8 boundaries, and this wasn't actually being taken into
account. (Instead, we prefer to enforce this invariant in the compiler, so
that the matching engines mostly don't have to care about it.) But of
course, the zero-width assertions are kind of a special case all around,
so we need to handle ASCII word boundaries differently depending on
whether we require valid UTF-8.
This bug was noticed because the DFA actually handles this correctly (by
encoding ASCII word boundaries into the state machine itself, which in turn
guarantees the valid UTF-8 invariant) while the NFAs don't, leading to an
inconsistency.
Fix#241.
When Unicode mode is disabled, we also disable the use of Unicode literals
in the regular expression, since it can lead to unintuitive behavior. In
this case, Unicode literals in character classes were not disallowed, and
subsequent code filtered them out, which resulted in an empty character
class. The compiler assumes that empty character classes are not allowed,
and so this causes an assert to trigger.
Fixes#250.
The compiler in particular assumes that it never gets an empty character
class. The current parser is pretty paranoid about rejecting empty classes,
but a few tricky cases made it through. In particular, one can write
`[^\d\D]` to correspond to "match nothing." This commit now looks for
empty classes explicitly, and if one is found, returns an error.
Interestingly, other regex engines allow this particular idiosyncrasy and
interpret it as "never match." Even more interesting, expressions like
`a{0}` are also allowed (including by this regex library) and are
interpreted as "always match the empty string." Both seem semantically the
same. In any case, we forbid empty character classes, primarily because
that seems like the sensible thing to do but secondarily because it's the
conservative choice.
It seems plausible that such a construct could be occasionally useful if
one were machine generating regexes, because it could be used to indicate
"never match." If we do want to support that use case, we'll need to add
a new opcode to the regex matching engines. One can still achieve
that today using something like `(a|[^a])`.
Fixes#257, where using such a form caused an assert to trip in the
compiler. A new, more explicit assert has been added.
When popping the parse stack at the end of parsing, we weren't actually
checking that the stack was empty. If the stack isn't empty, then that
indicates an unclosed parenthesis.
Fixes#239.
This uses the "Teddy" algorithm, as learned from the Hyperscan regular
expression library: https://01.org/hyperscan
This support optional, subject to the following:
1. A nightly compiler.
2. Enabling the `simd-accel` feature.
3. Adding `RUSTFLAGS="-C target-feature=+ssse3"` when compiling.
Previously, a byte based regex would always use byte based instructions.
This isn't actually necessary. If the regex has the Unicode flag enabled,
for example, then Unicode instructions can and should be used.
We can't mix and match though. If any part of the regex could match invalid
UTF-8, then we need to drop back to byte-based matching.
We could be more aggressive in the parser. In particular, we could check
if arbitrary character classes only match UTF-8, and if so, represent them
as Unicode classes instead of byte classes.
This optimization is only applicable to the NFA engines, since the DFA
operates exclusively on bytes. In particular, the NFA engines are much
faster with Unicode instructions.
In other words, `\b` in a `bytes::Regex` can now be used in the DFA.
This leads to a big performance boost:
```
sherlock::word_ending_n 115,465,261 (5 MB/s) 3,038,621 (195 MB/s) -112,426,640 -97.37%
```
Unfortunately, Unicode word boundaries continue to elude the DFA. This
state of affairs is lamentable, but after a lot of thought, I've
concluded there are only two ways to speed up Unicode word boundaries:
1. Come up with a hairbrained scheme to add multi-byte look-behind/ahead
to the lazy DFA. (The theory says it's possible. Figuring out how to
do this without combinatorial state explosion is not within my grasp
at the moment.)
2. Build a second lazy DFA with transitions on Unicode codepoints
instead of bytes. (The looming inevitability of this makes me queasy
for a number of reasons.)
To ameliorate this state of affairs, it is now possible to disable
Unicode support in `Regex::new` with `(?-u)`. In other words, one can
now use an ASCII word boundary with `(?-u:\b)`.
Disabling Unicode support does not violate any invariants around UTF-8.
In particular, if the regular expression could lead to a match of
invalid UTF-8, then the parser will return an error. (This only happens
for `Regex::new`. `bytes::Regex::new` still of course allows matching
arbitrary bytes.)
Finally, a new `PERFORMANCE.md` guide was written.
The principle change in this commit is a complete rewrite of how
literals are detected from a regular expression. In particular, we now
traverse the abstract syntax to discover literals instead of the
compiled byte code. This permits more tuneable control over which and
how many literals are extracted, and is now exposed in the
`regex-syntax` crate so that others can benefit from it.
Other changes in this commit:
* The Boyer-Moore algorithm was rewritten to use my own concoction based
on frequency analysis. We end up regressing on a couple benchmarks
slightly because of this, but gain in some others and in general should
be faster in a broader number of cases. (Principally because we try to
run `memchr` on the rarest byte in a literal.) This should also greatly
improve handling of non-Western text.
* A "reverse suffix" literal optimization was added. That is, if suffix
literals exist but no prefix literals exist, then we can quickly scan
for suffix matches and then run the DFA in reverse to find matches.
(I'm not aware of any other regex engine that does this.)
* The mutex-based pool has been replaced with a spinlock-based pool
(from the new `mempool` crate). This reduces some amount of constant
overhead and improves several benchmarks that either search short
haystacks or find many matches in long haystacks.
* Search parameters have been refactored.
* RegexSet can now contain 0 or more regular expressions (previously, it
could only contain 2 or more). The InvalidSet error variant is now
deprecated.
* A bug in computing start states was fixed. Namely, the DFA assumed the
start states was always the first instruction, which is trivially
wrong for an expression like `^☃$`. This bug persisted because it
typically occurred when a literal optimization would otherwise run.
* A new CLI tool, regex-debug, has been added as a non-published
sub-crate. The CLI tool can answer various facts about regular
expressions, such as printing its AST, its compiled byte code or its
detected literals.
Closes#96, #188, #189
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)
Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.
This PR makes a few other changes out of convenience:
1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.
Closes#85.
Regex sets permit matching multiple (possibly overlapping) regular
expressions in a single scan of the search text. This adds a few new
types, with `RegexSet` being the primary one.
All matching engines support regex sets, including the lazy DFA.
This commit also refactors a lot of the code around handling captures
into a central `Search`, which now also includes a set of matches that
is used by regex sets to determine which regex has matched.
We also merged the `Program` and `Insts` type, which were split up when
adding the lazy DFA, but the code seemed more complicated because of it.
Closes#156.
A lazy DFA is much faster than executing an NFA because it doesn't
repeat the work of following epsilon transitions over and and over.
Instead, it computes states during search and caches them for reuse. We
avoid exponential state blow up by bounding the cache in size. When the
DFA isn't powerful enough to fulfill the caller's request (e.g., return
sub-capture locations), it still runs to find the boundaries of the
match and then falls back to NFA execution on the matched region. The
lazy DFA can otherwise execute on every regular expression *except* for
regular expressions that contain word boundary assertions (`\b` or
`\B`). (They are tricky to implement in the lazy DFA because they are
Unicode aware and therefore require multi-byte look-behind/ahead.)
The implementation in this PR is based on the implementation in Google's
RE2 library.
Adding a lazy DFA was a substantial change and required several
modifications:
1. The compiler can now produce both Unicode based programs (still used by the
NFA engines) and byte based programs (required by the lazy DFA, but possible
to use in the NFA engines too). In byte based programs, UTF-8 decoding is
built into the automaton.
2. A new `Exec` type was introduced to implement the logic for compiling
and choosing the right engine to use on each search.
3. Prefix literal detection was rewritten to work on bytes.
4. Benchmarks were overhauled and new ones were added to more carefully
track the impact of various optimizations.
5. A new `HACKING.md` guide has been added that gives a high-level
design overview of this crate.
Other changes in this commit include:
1. Protection against stack overflows. All places that once required
recursion have now either acquired a bound or have been converted to
using a stack on the heap.
2. Update the Aho-Corasick dependency, which includes `memchr2` and
`memchr3` optimizations.
3. Add PCRE benchmarks using the Rust `pcre` bindings.
Closes#66, #146.
In commit 56ea4a, char classes were changed so that case folding them
stored all possible variants in the class ranges. This makes it possible
to drastically simplify the compiler to the point where case folding flags
can be completely removed. This has two major implications for
performance:
1. Matching engines no longer need to do case folding on the input.
2. Since case folding is now part of the automata, literal prefix
optimizations are now automatically applied even to regexes with
(?i).
This makes several changes in the public API of regex-syntax. Namely,
the `casei` flag has been removed from the `CharClass` expression and
the corresponding `is_case_insensitive` method has been removed.
TL;DR - The combination of case folding, character classes and nested
negation is darn tricky.
The problem presented in #99 was related to how we're storing case folded
character classes. Namely, we only store the canonical representation
of each character (which means that when we match text, we must apply
case folding to the input). But when this representation is negated,
information is lost.
From #99, consider the negated class with a single range `x`. The class is
negated before applying case folding. The negated class includes `X`,
so that case folding includes both `X` and `x` even though the regex
in #99 is specifically trying to not match either `X` or `x`.
The solution is to apply case folding *after* negation. But given our
representation, this doesn't work. Namely, case folding the range `x`
yields `x` with a case insensitive flag set. Negating this class ends up
matching all characters sans `x`, which means it will match `X`.
So I've backtracked the representation to include *all* case folding
variants. This means we can negate case folded classes and get the
expected result. e.g., case folding the class `[x]` yields `[xX]`, and
negating `[xX]` gives the desired result for the regex in #99.
When `regex-syntax` is compiled under debug mode, case folding can
take a significant amount of time. This path is easily triggered by
using case insensitive regexes.
This commit speeds up the case folding process by skipping binary
searches, although it is still not optimal. It could probably benefit
from a fresh approach, but let's leave it alone for now.