795 Commits

Author SHA1 Message Date
Andrew Gallant
2b1fc2772d
regex-debug: add character count
This adds a total character count to the output of the utf8-ranges
sub-command.
2018-03-18 08:57:27 -04:00
Andrew Gallant
47d1aeeb89
0.2.10 2018-03-16 11:39:23 -04:00
Andrew Gallant
c84bc41e5a
unstable: update to latest std::arch
This replaces `is_target_feature_detected!` with
`is_x86_feature_detected!` and adds the `cfg_target_feature` required
for using said macro.
2018-03-15 12:32:41 -04:00
Andrew Gallant
dba7f3b041
regex-syntax-0.5.3 2018-03-13 21:44:49 -04:00
Andrew Gallant
97651fb604 syntax/hir: add a printer for HIR
This adds a printer for the high-level intermediate representation. The
regex it prints is valid, and can be used as a way to turn it into a
regex::Regex.
2018-03-13 21:44:08 -04:00
Andrew Gallant
c230e59468 syntax/hir: fix handling of ASCII word boundaries
Previously, we had some inconsistencies in how we were handling ASCII
word boundaries. In particular, the translator was accepting a negated
ASCII word boundary even if the caller didn't disable the UTF-8 invariant.
This is wrong, since a negated ASCII word boundary can match between any
two arbitrary bytes. However, fixing this is a breaking change, so for
now we document the bug. We plan to fix it with regex 1.0. See #457.

Additionally, we were incorrectly declaring that an ASCII word boundary
matched invalid UTF-8 via the Hir::is_always_utf8 property. An ASCII word
boundary must always match an ASCII byte on one side, which implies a
valid UTF-8 position.
2018-03-13 21:44:08 -04:00
Andrew Gallant
c7c7a43827 style: reword ast::print docs
Also, small formatting fix and removal of debugging test.
2018-03-13 21:44:08 -04:00
Andrew Gallant
37379b09dc
0.2.9 2018-03-12 22:36:49 -04:00
Andrew Gallant
3e87082374
changelog: 0.2.9 2018-03-12 22:36:30 -04:00
Andrew Gallant
27ed3fa9fa doc: note the new unstable feature 2018-03-12 22:32:53 -04:00
Andrew Gallant
04e2930206 ci: remove RUSTFLAGS, enable unstable
This removes our compile time SIMD flags and replaces them with the
`unstable` feature, which will cause CI to use whatever CPU features are
available.

Ideally, we would test each important CPU feature combinations, but I'd
like to avoid doing that in one CI job and instead split them out into
separate CI jobs to keep CI times low. That requires more work.
2018-03-12 22:32:53 -04:00
Andrew Gallant
361459c27f bench: remove RUSTFLAGS
We no longer need to enable SIMD optimizations at compile time. They are
automatically enabled when regex is compiled with the `unstable`
feature.
2018-03-12 22:32:53 -04:00
Andrew Gallant
f962ddbff0 teddy: port teddy searcher to AVX2
This commit adds a copy of the Teddy searcher that works on AVX2. We
don't attempt to reuse any code between them just yet, and instead just
copy & paste and tweak parts of it to work on 32 bytes instead of 16.
(Some parts were trickier than others. For example, @jneem figured out
how to nearly compensate for the lack of a real 256-bit bytewise PALIGNR
instruction, which we borrow here.)

Overall, AVX2 provides a nice bump in performance.
2018-03-12 22:32:53 -04:00
Andrew Gallant
91296ddcc0 teddy: port teddy searcher to std::arch
This commit ports the Teddy searcher to use std::arch and moves off the
portable SIMD vector API. Performance remains the same, and it looks
like the codegen is identical, which is great!

This also makes the `simd-accel` feature a no-op and adds a new
`unstable` feature which will enable the Teddy optimization. The `-C
target-feature` or `-C target-cpu` settings are no longer necessary,
since this will now do runtime target feature detection.

We also add a new `unstable` feature to the regex crate, which will
enable this new use of std::arch. Once enabled, the Teddy optimizations
becomes available automatically without any additional compile time
flags.
2018-03-12 22:32:53 -04:00
Andrew Gallant
0baa9bf859 gitignore: add tmp dir 2018-03-12 22:32:53 -04:00
Andrew Gallant
a3c0510711
regex-syntax-0.5.2 2018-03-12 09:49:20 -04:00
Andrew Gallant
102458feff
syntax: fix trailing - bug
This fixes a bug in the parser where a regex like `(?x)[ / - ]` would
fail to parse. In particular, since whitespace insensitive mode is
enabled, this regex should be equivalent to `[/-]`, where the `-` is
treated as a literal `-` instead of a range since it is the last
character in the class. However, the parser did not account for
whitespace insensitive mode, so it didn't see the `-` in `(?x)[ / - ]`
as trailing, and therefore reported an unclosed character class (since
the `]` was treated as part of the range).

We fix that in this commit by accounting for whitespace insensitive
mode, which we do by adding a `peek` method that skips over whitespace.

Fixes #455
2018-03-12 09:27:02 -04:00
Andrew Gallant
3e370e4c6b
0.2.8 2018-03-12 08:19:53 -04:00
Andrew Gallant
b8b37e9ffb
deps: bump regex-syntax to 0.5.1 2018-03-12 08:19:44 -04:00
Andrew Gallant
8b374ed3e7
regex-syntax-0.5.1 2018-03-12 08:19:06 -04:00
Andrew Gallant
c3fa4a46cb
changelog 0.2.8 2018-03-12 08:18:32 -04:00
Andrew Gallant
0f32c0393a
regex/dfa: minor perf improvement
This commit improves the DFA's `follow_episilons` routine slightly. In
particular, it eliminates a sizable chunk of stack operations by using
a normal linear loop. The only time we use the stack is for a Split
instruction, which is still admittedly quite common. However, as we improve
the byte code, many of the Split instructions should go away.

Note that this is the same technique used by the backtracking and PikeVM
engines.
2018-03-10 08:09:36 -05:00
Andrew Gallant
a89220dd71
regex-syntax: fix nest limit checker
This commit fixes an embarrassing bug where the depth in the nest limit
checker was never decremented during postorder traversal, which means
long but shallow regexes would incorrectly trip the nest limit. We fix
that in this commit and add two regression tests.

Fixes #454
2018-03-09 22:45:55 -05:00
Andrew Gallant
649762db9b
regex: add nest_limit
This commit exposes the `nest_limit` option that regex-syntax provides.
The nest limit controls how deeply nested a regex is allowed to be.
2018-03-09 22:43:50 -05:00
ethanpailes
7f23152b23 doc: resync TBM should_use comment
The TBM `should_use` comment drifted slightly
out of sync with the code when a better usage huristic
was added. I've shaved the yak.
2018-03-09 07:12:03 -05:00
Andrew Gallant
cbfc0a38de
0.2.7 2018-03-07 19:13:22 -05:00
Andrew Gallant
8aa479dac3
changelog: 0.2.7 2018-03-07 19:12:03 -05:00
Andrew Gallant
052176d67f
regex/literals: re-enable Tuned Boyer-Moore
We've added tests and carefully scrutinized it. Let's try this again.
2018-03-07 19:07:34 -05:00
Andrew Gallant
d756dba73e
tests: remove unused plugin tests 2018-03-07 19:06:06 -05:00
ethanpailes
c075e18c62 regex/literal: add quickcheck property for Boyer-Moore
If you just generate two random strings, the odds are very high
that the shorter one won't be a substring of the longer one once
they reach any substantial length. This means that the existing
quickcheck cases were probably just testing the negative cases.
The exception would be the two cases that append the needle
to the haystack, but those only test behavior at the ends. This
patch adds a better quickcheck case that can test a needle anywhere
in the haystack.

Fixes #446
2018-03-07 19:03:13 -05:00
Andrew Gallant
4ce111568b changelog: update for next release 2018-03-07 19:01:24 -05:00
Andrew Gallant
b3e5fd2dde regex: remove old regex-syntax crate
This commit does the mechanical changes necessary to remove the old
regex-syntax crate and replace it with the rewrite. The rewrite now
subsumes the `regex-syntax` crate name, and gets a semver bump to 0.5.0.
2018-03-07 19:01:24 -05:00
Andrew Gallant
efff9fa20e doc: update README 2018-03-07 19:01:24 -05:00
Andrew Gallant
f3b0c66347 regex: better formatting for syntax errors
This commit adds an explicit Debug impl for regex's main Error type.
The purpose of this impl is to format parse errors in normal panic
messages more nicely. This is slightly idiosyncratic, but the default
Debug impl prints the full string anyway, we might as well format it
nicely.

See also: #450
2018-03-07 19:01:24 -05:00
Andrew Gallant
040a71f9d4 regex-debug: add utf8-ranges sub-command
This sub-command prints out the UTF-8 alternation machine for an
arbitrary character class.
2018-03-07 19:01:24 -05:00
Andrew Gallant
eb03ef11c8 doc: document Unicode support
This commit provides exhaustive documentation for the regex crate's support
for Level 1 ("Basic Unicode Support") as documented in UTS#18.

We also document the small number of additions added to the concrete
syntax as a result of the regex-syntax rewrite.

See: http://unicode.org/reports/tr18/
2018-03-07 19:01:24 -05:00
Andrew Gallant
b906fd55c5 tests: add Unicode general category tests 2018-03-07 19:01:24 -05:00
Andrew Gallant
ddcbf5b44d compile: ban empty sub-expressions
With the regex syntax rewrite, we now support empty subexpressions more
officially. Unfortunately, the compiler has trouble with empty
subexpressions in alternation branches. There's no particular reason to
not support for them, but they are difficult/awkward to express with the
current compiler. So just ban them for now.

If one does need an empty subexpression in an alternate branch, then
amusingly, something like `()?|z` will work. We could rewrite all such
empty sub-expressions into `()?`, which would retain the same match
semantics, but we choose to take the most conservative change possible.
2018-03-07 19:01:24 -05:00
Andrew Gallant
4ae3ae9d92 regex: move to regex-syntax-2
This commit moves the entire regex crate over to the regex-syntax-2
rewrite. Most of this is just rewriting types.

The compiler got the most interesting set of changes. It got simpler
in some respects, but not significantly so.
2018-03-07 19:01:24 -05:00
Andrew Gallant
715a807289 syntax: rewrite the regex-syntax crate
This commit represents a ground up rewrite of the regex-syntax crate.
This commit is also an intermediate state. That is, it adds a new
regex-syntax-2 crate without making any serious changes to any other
code. Subsequent commits will cover the integration of the rewrite and
the removal of the old crate.

The rewrite is intended to be the first phase in an effort to overhaul
the entire regex crate. To that end, this rewrite takes steps in that
direction:

* The principle change in the public API is an explicit split between a
  regular expression's abstract syntax (AST) and a high-level
  intermediate representation (HIR) that is easier to analyze. The old
  version of this crate mixes these two concepts, but leaned heavily
  towards an HIR. The AST in the rewrite has a much closer
  correspondence with the concrete syntax than the old `Expr` type does.
  The new HIR embraces its role; all flags are now compiled away
  (including the `i` flag), which will simplify subsequent passes,
  including literal detection and the compiler. ASTs are produced by
  ast::parse and HIR is produced by hir::translate. A top-level parser
  is provided that combines these so that callers can skip straight from
  concrete syntax to HIR.
* Error messages are vastly improved thanks to the span information that
  is now embedded in the AST. In addition to better formatting, error
  messages now also include helpful hints when trying to use features
  that aren't supported (like backreferences and look-around). In
  particular, octal support is now an opt-in option. (Octal support
  will continue to be enabled in regex proper to support backwards
  compatibility, but will be disabled in 1.0.)
* More robust support for Unicode Level 1 as described in UTS#18.
  In particular, we now fully support Unicode character classes
  including set notation (difference, intersection, symmetric
  difference) and correct support for named general categories, scripts,
  script extensions and age. That is, `\p{scx:Hira}` and `p{age:3.0}`
  now work. To make this work, we introduce an internal interval set
  data structure.
* With the exception of literal extraction (which will be overhauled in
  a later phase), all code in the rewrite uses constant stack space,
  even while performing analysis that requires structural induction over
  the AST or HIR. This is done by pushing the call stack onto the heap,
  and is abstracted by the `ast::Visitor` and `hir::Visitor` traits.
  The point of this method is to eliminate stack overflows in the
  general case.
* Empty sub-expressions are now properly supported. Expressions like
  `()`, `|`, `a|` and `b|()+` are now valid syntax.

The principle downsides of these changes are parse time and binary size.
Both seemed to have increased (slower and bigger) by about 1.5x. Parse
time is generally peanuts compared to the compiler, so we mostly don't
care about that. Binary size is mildly unfortunate, and if it becomes a
serious issue, it should be possible to introduce a feature that
disables some level of Unicode support and/or work on compressing the
Unicode tables. Compile times have increased slightly, but are still a
very small fraction of the overall time it takes to compile `regex`.

Fixes #174, Fixes #424
2018-03-07 19:01:24 -05:00
Matt Brubeck
7f020b8de0
regex: add Replacer::by_ref adaptor
This permits use of a Replacer without consuming it.

Note: This can't simply return `&mut Self` because a generic
`impl<R: Replacer> Replacer for &mut R` would conflict with libstd's
generic `impl<F: FnMut> FnMut for &mut F`.

See also: #83

Closes #449
2018-03-07 15:39:50 -05:00
ethanpailes
7645ff2bc0 regex/literal: fix bug in Boyer-Moore
This patch fixes an issue where skip resolution would go strait
to the default value (the md2_shift) on a match failure after
the shift_loop. Now we do the right thing, and first check in
the skip table. The problem with going strait to the md2_shift
is that you can accidentally shift to far when `window_end`
actually is in the pattern (as is the case for the failing
match).
2018-03-07 15:33:29 -05:00
Andrew Gallant
43bb64b254
bench: small tweaks
This adds object files (produced by D compilers) to gitignore, and adds
RE2 to the benchmark compilation script by default.
2018-03-04 09:23:56 -05:00
Andrew Gallant
b0113ec3db
ci: remove doc generation
This has apparently been broken for a while, and with docs.rs, we don't
need it any more.

Tangentially, this method seemingly required a personal access token, which
seems like a bad idea in a shared repo.
2018-02-18 13:22:29 -05:00
Andrew Gallant
5eb4552262
ci: reformat 2018-02-18 13:21:52 -05:00
Andrew Gallant
f0b92ca277
bench: update to memmap 0.6 2018-02-17 22:14:47 -05:00
Andrew Gallant
9ee9943ec8
dfa: disable if size limit is 0
As a special case, if the user configures a DFA size limit of 0, then we
should never try to use it. This avoids a bit of thrashing where the DFA
tries to senselessly run before spilling over to the NFA.
2018-02-09 23:13:01 -05:00
Andrew Gallant
3182b23f34
0.2.6 2018-02-08 18:14:56 -05:00
Andrew Gallant
2dee2fe3f2
bench: add logs 2018-02-08 18:14:47 -05:00
Andrew Gallant
04355544f1
changelog: 0.2.6 2018-02-08 18:12:20 -05:00