201 Commits

Author SHA1 Message Date
Andrew Gallant
6300490803 Fix a bug in uambiguous prefixes.
Specifically, given the strings ABCX, CDAX and BCX, it was reporting
the unambiguous set as A, BCX and CDAX, which is wrong since A is a
substring of CDAX.

unambiguous_prefixes is now quite a bit of a mess, but so is the rest
of the literal extraction code. The only thing it has going for it is
a massive test suite.

Fixes #289
2016-10-26 17:27:23 -04:00
Andrew Gallant
3cfef1e79d regex-syntax-0.3.7 2016-10-10 21:54:14 -04:00
Andrew Gallant
53e7ed5ff5 regex-syntax-0.3.6 2016-10-10 21:53:40 -04:00
Andrew Gallant
96d90f8b5b tag version in regex-syntax/Cargo.toml 2016-10-10 21:53:06 -04:00
Andrew Gallant
e5b4063815 Fix bug in expression printing round trip.
In particular, if a range started or ended with `-`, then the pretty
printer would naively print `-` as if it were a normal character. For
example, `[--/]` corresponds to the class `[-/]` since the first `-` is
treated as a literal `-` and not part of any range.

There are a couple ways to fix this. For now, we fix the pretty printer
so that it never prints a `-` as part of a range.

This bug was found through ripgrep:
https://github.com/BurntSushi/ripgrep/issues/156
2016-10-10 21:53:06 -04:00
aweinstock314
54048b6df3 Fix invariant in comment
`(end >= end)` is trivial, `(end >= start)` is likely intended.
2016-09-13 18:04:08 -04:00
Andrew Gallant
dc92a0bd4e Add a way to trim suffixes from literal sets. 2016-09-11 16:01:05 -04:00
Andrew Gallant
1488bb8ea3 regex-syntax 0.3.5 2016-09-04 09:31:39 -04:00
Andrew Gallant
e3449685f2 Add 'u' flag to error message. 2016-09-04 09:28:44 -04:00
Andrew Gallant
225f8e190d Disable literal optimizations for partially anchored regexes.
The specific problem here is that our literal search doesn't know about
anchors, so it will try to search all of the detected literals in a regex.
In a regex like `a|^b`, the literal `b` should only be searched for at the
beginning of the haystack and in no other place.

The right way to fix this is probably to make the literal detector smarter,
but the literal detector is already too complex. Instead, this commit
detects whether a regex is partially anchored (that is, when the regex has
at least one matchable sub-expression that is anchored), and if so,
disables the literal engine.

Note that this doesn't disable all literal optimizations, just the
optimization that opts out of regex engines entirely. Both the DFA and the
NFA will still use literal prefixes to search. Namely, if it searches and
finds a literal that needs to be anchored but isn't in the haystack, then
the regex engine rules it out as a false positive.

Fixes #268.
2016-08-04 20:17:59 -04:00
Andrew Gallant
98b9da3b52 regex-syntax 0.3.4 2016-07-10 00:45:41 -04:00
Andrew Gallant
84a2bf5d73 Match (?-u:\B) correctly in the NFA engines when valid UTF-8 is required.
This commit fixes a bug where matching (?-u:\B) (that is, "not an ASCII
word boundary") in the NFA engines could produce match positions at invalid
UTF-8 sequence boundaries. The specific problem is that determining whether
(?-u:\B) matches or not relies on knowing whether we must report matches
only at UTF-8 boundaries, and this wasn't actually being taken into
account. (Instead, we prefer to enforce this invariant in the compiler, so
that the matching engines mostly don't have to care about it.) But of
course, the zero-width assertions are kind of a special case all around,
so we need to handle ASCII word boundaries differently depending on
whether we require valid UTF-8.

This bug was noticed because the DFA actually handles this correctly (by
encoding ASCII word boundaries into the state machine itself, which in turn
guarantees the valid UTF-8 invariant) while the NFAs don't, leading to an
inconsistency.

Fix #241.
2016-07-09 22:45:11 -04:00
Andrew Gallant
81297f09cf Disallow Unicode literals in character classes when Unicode is disabled.
When Unicode mode is disabled, we also disable the use of Unicode literals
in the regular expression, since it can lead to unintuitive behavior. In
this case, Unicode literals in character classes were not disallowed, and
subsequent code filtered them out, which resulted in an empty character
class. The compiler assumes that empty character classes are not allowed,
and so this causes an assert to trigger.

Fixes #250.
2016-07-09 19:07:24 -04:00
Andrew Gallant
9062f38eff Disallow empty character class ranges.
The compiler in particular assumes that it never gets an empty character
class. The current parser is pretty paranoid about rejecting empty classes,
but a few tricky cases made it through. In particular, one can write
`[^\d\D]` to correspond to "match nothing." This commit now looks for
empty classes explicitly, and if one is found, returns an error.

Interestingly, other regex engines allow this particular idiosyncrasy and
interpret it as "never match." Even more interesting, expressions like
`a{0}` are also allowed (including by this regex library) and are
interpreted as "always match the empty string." Both seem semantically the
same. In any case, we forbid empty character classes, primarily because
that seems like the sensible thing to do but secondarily because it's the
conservative choice.

It seems plausible that such a construct could be occasionally useful if
one were machine generating regexes, because it could be used to indicate
"never match." If we do want to support that use case, we'll need to add
a new opcode to the regex matching engines. One can still achieve
that today using something like `(a|[^a])`.

Fixes #257, where using such a form caused an assert to trip in the
compiler. A new, more explicit assert has been added.
2016-07-09 16:52:58 -04:00
Andrew Gallant
ed4f046bb1 regex-syntax 0.3.3 2016-06-16 07:21:00 -04:00
Andrew Gallant
e84a22485c Fix bug where unclosed paren was allowed.
When popping the parse stack at the end of parsing, we weren't actually
checking that the stack was empty. If the stack isn't empty, then that
indicates an unclosed parenthesis.

Fixes #239.
2016-05-23 10:00:09 -04:00
Andrew Gallant
f4e9e15e58 regex-syntax 0.3.2 2016-05-20 06:36:10 -04:00
Andrew Gallant
f9af58c4ca Fixes #234.
It turns out that we weren't compute suffix literals correctly in all
cases. In particular, the bytes from a Unicode character were being
reversed.
2016-05-19 17:47:44 -04:00
Andrew Gallant
203c509df9 Add SIMD accelerated multiple pattern search.
This uses the "Teddy" algorithm, as learned from the Hyperscan regular
expression library: https://01.org/hyperscan

This support optional, subject to the following:

1. A nightly compiler.
2. Enabling the `simd-accel` feature.
3. Adding `RUSTFLAGS="-C target-feature=+ssse3"` when compiling.
2016-05-18 10:48:13 -04:00
Andrew Gallant
0e84e194e1 Use Unicode instructions more aggressively.
Previously, a byte based regex would always use byte based instructions.
This isn't actually necessary. If the regex has the Unicode flag enabled,
for example, then Unicode instructions can and should be used.

We can't mix and match though. If any part of the regex could match invalid
UTF-8, then we need to drop back to byte-based matching.

We could be more aggressive in the parser. In particular, we could check
if arbitrary character classes only match UTF-8, and if so, represent them
as Unicode classes instead of byte classes.

This optimization is only applicable to the NFA engines, since the DFA
operates exclusively on bytes. In particular, the NFA engines are much
faster with Unicode instructions.
2016-05-01 12:45:16 -04:00
Andrew Gallant
ab72269b23 Add ASCII word boundaries to the lazy DFA.
In other words, `\b` in a `bytes::Regex` can now be used in the DFA.
This leads to a big performance boost:

```
sherlock::word_ending_n                  115,465,261 (5 MB/s)  3,038,621 (195 MB/s)    -112,426,640  -97.37%
```

Unfortunately, Unicode word boundaries continue to elude the DFA. This
state of affairs is lamentable, but after a lot of thought, I've
concluded there are only two ways to speed up Unicode word boundaries:

1. Come up with a hairbrained scheme to add multi-byte look-behind/ahead
   to the lazy DFA. (The theory says it's possible. Figuring out how to
   do this without combinatorial state explosion is not within my grasp
   at the moment.)
2. Build a second lazy DFA with transitions on Unicode codepoints
   instead of bytes. (The looming inevitability of this makes me queasy
   for a number of reasons.)

To ameliorate this state of affairs, it is now possible to disable
Unicode support in `Regex::new` with `(?-u)`. In other words, one can
now use an ASCII word boundary with `(?-u:\b)`.

Disabling Unicode support does not violate any invariants around UTF-8.
In particular, if the regular expression could lead to a match of
invalid UTF-8, then the parser will return an error. (This only happens
for `Regex::new`. `bytes::Regex::new` still of course allows matching
arbitrary bytes.)

Finally, a new `PERFORMANCE.md` guide was written.
2016-04-08 22:58:10 -04:00
Andrew Gallant
1912d34c9c Always use Unicode flag when printing an expression. 2016-03-30 16:14:50 -04:00
Andrew Gallant
df6614fb1a Don't overrun the literal when adding it. 2016-03-29 12:48:25 -04:00
Andrew Gallant
6e0e01d085 regex-syntax 0.3.1 2016-03-28 16:32:36 -04:00
Andrew Gallant
31a317eadd Major literal optimization refactoring.
The principle change in this commit is a complete rewrite of how
literals are detected from a regular expression. In particular, we now
traverse the abstract syntax to discover literals instead of the
compiled byte code. This permits more tuneable control over which and
how many literals are extracted, and is now exposed in the
`regex-syntax` crate so that others can benefit from it.

Other changes in this commit:

* The Boyer-Moore algorithm was rewritten to use my own concoction based
  on frequency analysis. We end up regressing on a couple benchmarks
  slightly because of this, but gain in some others and in general should
  be faster in a broader number of cases. (Principally because we try to
  run `memchr` on the rarest byte in a literal.) This should also greatly
  improve handling of non-Western text.
* A "reverse suffix" literal optimization was added. That is, if suffix
  literals exist but no prefix literals exist, then we can quickly scan
  for suffix matches and then run the DFA in reverse to find matches.
  (I'm not aware of any other regex engine that does this.)
* The mutex-based pool has been replaced with a spinlock-based pool
  (from the new `mempool` crate). This reduces some amount of constant
  overhead and improves several benchmarks that either search short
  haystacks or find many matches in long haystacks.
* Search parameters have been refactored.
* RegexSet can now contain 0 or more regular expressions (previously, it
  could only contain 2 or more). The InvalidSet error variant is now
  deprecated.
* A bug in computing start states was fixed. Namely, the DFA assumed the
  start states was always the first instruction, which is trivially
  wrong for an expression like `^☃$`. This bug persisted because it
  typically occurred when a literal optimization would otherwise run.
* A new CLI tool, regex-debug, has been added as a non-published
  sub-crate. The CLI tool can answer various facts about regular
  expressions, such as printing its AST, its compiled byte code or its
  detected literals.

Closes #96, #188, #189
2016-03-27 20:07:46 -04:00
Andrew Gallant
28c0b0d8b8 regex-syntax 0.3.0 2016-03-13 11:09:19 -04:00
Andrew Gallant
d98ec1b1a5 Add regex matching for &[u8].
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.
2016-03-09 21:23:29 -05:00
Andrew Gallant
99b32a3418 regex-syntax 0.2.5 2016-02-23 06:46:10 -05:00
Andrew Gallant
9967e07420 Add ExprBuilder, which can set the default values of flags when parsing.
Closes #172.
2016-02-22 19:44:21 -05:00
Andrew Gallant
f6a3e8977e regex-syntax 0.2.4 2016-02-22 07:05:20 -05:00
Andrew Gallant
94d0ad486e Add regex sets.
Regex sets permit matching multiple (possibly overlapping) regular
expressions in a single scan of the search text. This adds a few new
types, with `RegexSet` being the primary one.

All matching engines support regex sets, including the lazy DFA.

This commit also refactors a lot of the code around handling captures
into a central `Search`, which now also includes a set of matches that
is used by regex sets to determine which regex has matched.

We also merged the `Program` and `Insts` type, which were split up when
adding the lazy DFA, but the code seemed more complicated because of it.

Closes #156.
2016-02-21 20:52:07 -05:00
Andrew Gallant
640bfa762d Update old documentation for #172. 2016-02-21 06:36:44 -05:00
Andrew Gallant
d0ed5f1c22 regex-syntax 0.2.3 2016-02-15 16:27:09 -05:00
Andrew Gallant
2aa172779e Add a lazy DFA.
A lazy DFA is much faster than executing an NFA because it doesn't
repeat the work of following epsilon transitions over and and over.
Instead, it computes states during search and caches them for reuse. We
avoid exponential state blow up by bounding the cache in size. When the
DFA isn't powerful enough to fulfill the caller's request (e.g., return
sub-capture locations), it still runs to find the boundaries of the
match and then falls back to NFA execution on the matched region. The
lazy DFA can otherwise execute on every regular expression *except* for
regular expressions that contain word boundary assertions (`\b` or
`\B`). (They are tricky to implement in the lazy DFA because they are
Unicode aware and therefore require multi-byte look-behind/ahead.)
The implementation in this PR is based on the implementation in Google's
RE2 library.

Adding a lazy DFA was a substantial change and required several
modifications:

1. The compiler can now produce both Unicode based programs (still used by the
   NFA engines) and byte based programs (required by the lazy DFA, but possible
   to use in the NFA engines too). In byte based programs, UTF-8 decoding is
   built into the automaton.
2. A new `Exec` type was introduced to implement the logic for compiling
   and choosing the right engine to use on each search.
3. Prefix literal detection was rewritten to work on bytes.
4. Benchmarks were overhauled and new ones were added to more carefully
   track the impact of various optimizations.
5. A new `HACKING.md` guide has been added that gives a high-level
   design overview of this crate.

Other changes in this commit include:

1. Protection against stack overflows. All places that once required
   recursion have now either acquired a bound or have been converted to
   using a stack on the heap.
2. Update the Aho-Corasick dependency, which includes `memchr2` and
   `memchr3` optimizations.
3. Add PCRE benchmarks using the Rust `pcre` bindings.

Closes #66, #146.
2016-02-15 15:42:04 -05:00
Andrew Gallant
a21e294c86 Fix a bug in negating a character class.
If the character class is empty and it is negated, then it should
yield all of Unicode. Previously, it was returning an empty class.
2016-01-30 22:51:11 -05:00
Andrew Gallant
2065b416e9 Fix doc bug reported by insaneinside on IRC. 2016-01-07 16:58:15 -05:00
Andrew Gallant
971e15d3db Fix doc url. 2015-09-09 22:40:29 -04:00
Sean
c6b93ce8c7 Simplify comment handling in the parser. 2015-08-05 08:38:27 +01:00
Andrew Gallant
b6d339ec63 version bump 2015-07-21 18:36:13 -04:00
Andrew Gallant
863677f095 Fix #101.
The order that the character class was defined was incorrect. Do'h.
2015-07-21 18:35:15 -04:00
Andrew Gallant
d385028ed4 version bumps. 2015-07-05 13:17:05 -04:00
Andrew Gallant
cedfc8db51 Re-work case insensitive matching.
In commit 56ea4a, char classes were changed so that case folding them
stored all possible variants in the class ranges. This makes it possible
to drastically simplify the compiler to the point where case folding flags
can be completely removed. This has two major implications for
performance:

  1. Matching engines no longer need to do case folding on the input.
  2. Since case folding is now part of the automata, literal prefix
     optimizations are now automatically applied even to regexes with
     (?i).

This makes several changes in the public API of regex-syntax. Namely,
the `casei` flag has been removed from the `CharClass` expression and
the corresponding `is_case_insensitive` method has been removed.
2015-07-05 13:13:41 -04:00
Andrew Gallant
fb5868fcc7 version bump 2015-07-05 11:47:31 -04:00
Andrew Gallant
56ea4a835c Fixes #99.
TL;DR - The combination of case folding, character classes and nested
negation is darn tricky.

The problem presented in #99 was related to how we're storing case folded
character classes. Namely, we only store the canonical representation
of each character (which means that when we match text, we must apply
case folding to the input). But when this representation is negated,
information is lost.

From #99, consider the negated class with a single range `x`. The class is
negated before applying case folding. The negated class includes `X`,
so that case folding includes both `X` and `x` even though the regex
in #99 is specifically trying to not match either `X` or `x`.

The solution is to apply case folding *after* negation. But given our
representation, this doesn't work. Namely, case folding the range `x`
yields `x` with a case insensitive flag set. Negating this class ends up
matching all characters sans `x`, which means it will match `X`.

So I've backtracked the representation to include *all* case folding
variants. This means we can negate case folded classes and get the
expected result. e.g., case folding the class `[x]` yields `[xX]`, and
negating `[xX]` gives the desired result for the regex in #99.
2015-07-05 11:46:11 -04:00
Andrew Gallant
1e79c4d9ee regex-syntax: version bump 2015-06-02 18:16:43 -04:00
Andrew Gallant
f9fc8614d2 Optimize case folding.
When `regex-syntax` is compiled under debug mode, case folding can
take a significant amount of time. This path is easily triggered by
using case insensitive regexes.

This commit speeds up the case folding process by skipping binary
searches, although it is still not optimal. It could probably benefit
from a fresh approach, but let's leave it alone for now.
2015-06-02 18:16:04 -04:00
Andrew Gallant
7a72b1fc57 version bump.
Actually, I don't think I needed to bump `regex` proper. Whoops.
2015-05-28 19:14:55 -04:00
Pascal Hertleif
c427a3f4ff Adjust Some Formatting, Use checkadd More
Related to #88
2015-05-29 00:52:43 +02:00
Pascal Hertleif
13eb7bef5f Add '\#' Escaping
Fixes #88
2015-05-28 20:22:54 +02:00
Pascal Hertleif
349158ed27 [WIP] Treat '#' as Punctuation
Relates to #88
2015-05-28 18:31:06 +02:00