29 Commits

Author SHA1 Message Date
Andrew Gallant
df6614fb1a Don't overrun the literal when adding it. 2016-03-29 12:48:25 -04:00
Andrew Gallant
6e0e01d085 regex-syntax 0.3.1 2016-03-28 16:32:36 -04:00
Andrew Gallant
31a317eadd Major literal optimization refactoring.
The principle change in this commit is a complete rewrite of how
literals are detected from a regular expression. In particular, we now
traverse the abstract syntax to discover literals instead of the
compiled byte code. This permits more tuneable control over which and
how many literals are extracted, and is now exposed in the
`regex-syntax` crate so that others can benefit from it.

Other changes in this commit:

* The Boyer-Moore algorithm was rewritten to use my own concoction based
  on frequency analysis. We end up regressing on a couple benchmarks
  slightly because of this, but gain in some others and in general should
  be faster in a broader number of cases. (Principally because we try to
  run `memchr` on the rarest byte in a literal.) This should also greatly
  improve handling of non-Western text.
* A "reverse suffix" literal optimization was added. That is, if suffix
  literals exist but no prefix literals exist, then we can quickly scan
  for suffix matches and then run the DFA in reverse to find matches.
  (I'm not aware of any other regex engine that does this.)
* The mutex-based pool has been replaced with a spinlock-based pool
  (from the new `mempool` crate). This reduces some amount of constant
  overhead and improves several benchmarks that either search short
  haystacks or find many matches in long haystacks.
* Search parameters have been refactored.
* RegexSet can now contain 0 or more regular expressions (previously, it
  could only contain 2 or more). The InvalidSet error variant is now
  deprecated.
* A bug in computing start states was fixed. Namely, the DFA assumed the
  start states was always the first instruction, which is trivially
  wrong for an expression like `^☃$`. This bug persisted because it
  typically occurred when a literal optimization would otherwise run.
* A new CLI tool, regex-debug, has been added as a non-published
  sub-crate. The CLI tool can answer various facts about regular
  expressions, such as printing its AST, its compiled byte code or its
  detected literals.

Closes #96, #188, #189
2016-03-27 20:07:46 -04:00
Andrew Gallant
28c0b0d8b8 regex-syntax 0.3.0 2016-03-13 11:09:19 -04:00
Andrew Gallant
d98ec1b1a5 Add regex matching for &[u8].
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.
2016-03-09 21:23:29 -05:00
Andrew Gallant
99b32a3418 regex-syntax 0.2.5 2016-02-23 06:46:10 -05:00
Andrew Gallant
9967e07420 Add ExprBuilder, which can set the default values of flags when parsing.
Closes #172.
2016-02-22 19:44:21 -05:00
Andrew Gallant
f6a3e8977e regex-syntax 0.2.4 2016-02-22 07:05:20 -05:00
Andrew Gallant
94d0ad486e Add regex sets.
Regex sets permit matching multiple (possibly overlapping) regular
expressions in a single scan of the search text. This adds a few new
types, with `RegexSet` being the primary one.

All matching engines support regex sets, including the lazy DFA.

This commit also refactors a lot of the code around handling captures
into a central `Search`, which now also includes a set of matches that
is used by regex sets to determine which regex has matched.

We also merged the `Program` and `Insts` type, which were split up when
adding the lazy DFA, but the code seemed more complicated because of it.

Closes #156.
2016-02-21 20:52:07 -05:00
Andrew Gallant
640bfa762d Update old documentation for #172. 2016-02-21 06:36:44 -05:00
Andrew Gallant
d0ed5f1c22 regex-syntax 0.2.3 2016-02-15 16:27:09 -05:00
Andrew Gallant
2aa172779e Add a lazy DFA.
A lazy DFA is much faster than executing an NFA because it doesn't
repeat the work of following epsilon transitions over and and over.
Instead, it computes states during search and caches them for reuse. We
avoid exponential state blow up by bounding the cache in size. When the
DFA isn't powerful enough to fulfill the caller's request (e.g., return
sub-capture locations), it still runs to find the boundaries of the
match and then falls back to NFA execution on the matched region. The
lazy DFA can otherwise execute on every regular expression *except* for
regular expressions that contain word boundary assertions (`\b` or
`\B`). (They are tricky to implement in the lazy DFA because they are
Unicode aware and therefore require multi-byte look-behind/ahead.)
The implementation in this PR is based on the implementation in Google's
RE2 library.

Adding a lazy DFA was a substantial change and required several
modifications:

1. The compiler can now produce both Unicode based programs (still used by the
   NFA engines) and byte based programs (required by the lazy DFA, but possible
   to use in the NFA engines too). In byte based programs, UTF-8 decoding is
   built into the automaton.
2. A new `Exec` type was introduced to implement the logic for compiling
   and choosing the right engine to use on each search.
3. Prefix literal detection was rewritten to work on bytes.
4. Benchmarks were overhauled and new ones were added to more carefully
   track the impact of various optimizations.
5. A new `HACKING.md` guide has been added that gives a high-level
   design overview of this crate.

Other changes in this commit include:

1. Protection against stack overflows. All places that once required
   recursion have now either acquired a bound or have been converted to
   using a stack on the heap.
2. Update the Aho-Corasick dependency, which includes `memchr2` and
   `memchr3` optimizations.
3. Add PCRE benchmarks using the Rust `pcre` bindings.

Closes #66, #146.
2016-02-15 15:42:04 -05:00
Andrew Gallant
a21e294c86 Fix a bug in negating a character class.
If the character class is empty and it is negated, then it should
yield all of Unicode. Previously, it was returning an empty class.
2016-01-30 22:51:11 -05:00
Andrew Gallant
2065b416e9 Fix doc bug reported by insaneinside on IRC. 2016-01-07 16:58:15 -05:00
Andrew Gallant
971e15d3db Fix doc url. 2015-09-09 22:40:29 -04:00
Sean
c6b93ce8c7 Simplify comment handling in the parser. 2015-08-05 08:38:27 +01:00
Andrew Gallant
b6d339ec63 version bump 2015-07-21 18:36:13 -04:00
Andrew Gallant
863677f095 Fix #101.
The order that the character class was defined was incorrect. Do'h.
2015-07-21 18:35:15 -04:00
Andrew Gallant
d385028ed4 version bumps. 2015-07-05 13:17:05 -04:00
Andrew Gallant
cedfc8db51 Re-work case insensitive matching.
In commit 56ea4a, char classes were changed so that case folding them
stored all possible variants in the class ranges. This makes it possible
to drastically simplify the compiler to the point where case folding flags
can be completely removed. This has two major implications for
performance:

  1. Matching engines no longer need to do case folding on the input.
  2. Since case folding is now part of the automata, literal prefix
     optimizations are now automatically applied even to regexes with
     (?i).

This makes several changes in the public API of regex-syntax. Namely,
the `casei` flag has been removed from the `CharClass` expression and
the corresponding `is_case_insensitive` method has been removed.
2015-07-05 13:13:41 -04:00
Andrew Gallant
fb5868fcc7 version bump 2015-07-05 11:47:31 -04:00
Andrew Gallant
56ea4a835c Fixes #99.
TL;DR - The combination of case folding, character classes and nested
negation is darn tricky.

The problem presented in #99 was related to how we're storing case folded
character classes. Namely, we only store the canonical representation
of each character (which means that when we match text, we must apply
case folding to the input). But when this representation is negated,
information is lost.

From #99, consider the negated class with a single range `x`. The class is
negated before applying case folding. The negated class includes `X`,
so that case folding includes both `X` and `x` even though the regex
in #99 is specifically trying to not match either `X` or `x`.

The solution is to apply case folding *after* negation. But given our
representation, this doesn't work. Namely, case folding the range `x`
yields `x` with a case insensitive flag set. Negating this class ends up
matching all characters sans `x`, which means it will match `X`.

So I've backtracked the representation to include *all* case folding
variants. This means we can negate case folded classes and get the
expected result. e.g., case folding the class `[x]` yields `[xX]`, and
negating `[xX]` gives the desired result for the regex in #99.
2015-07-05 11:46:11 -04:00
Andrew Gallant
1e79c4d9ee regex-syntax: version bump 2015-06-02 18:16:43 -04:00
Andrew Gallant
f9fc8614d2 Optimize case folding.
When `regex-syntax` is compiled under debug mode, case folding can
take a significant amount of time. This path is easily triggered by
using case insensitive regexes.

This commit speeds up the case folding process by skipping binary
searches, although it is still not optimal. It could probably benefit
from a fresh approach, but let's leave it alone for now.
2015-06-02 18:16:04 -04:00
Andrew Gallant
7a72b1fc57 version bump.
Actually, I don't think I needed to bump `regex` proper. Whoops.
2015-05-28 19:14:55 -04:00
Pascal Hertleif
c427a3f4ff Adjust Some Formatting, Use checkadd More
Related to #88
2015-05-29 00:52:43 +02:00
Pascal Hertleif
13eb7bef5f Add '\#' Escaping
Fixes #88
2015-05-28 20:22:54 +02:00
Pascal Hertleif
349158ed27 [WIP] Treat '#' as Punctuation
Relates to #88
2015-05-28 18:31:06 +02:00
Andrew Gallant
6d5e909e5e Fixes from code review.
The big change here is the addition of a non-public variant in the
error enums. This will hint to users that one shouldn't exhaustively
match the enums in case new variants are added.
2015-05-27 18:43:28 -04:00