This adds a printer for the high-level intermediate representation. The
regex it prints is valid, and can be used as a way to turn it into a
regex::Regex.
Previously, we had some inconsistencies in how we were handling ASCII
word boundaries. In particular, the translator was accepting a negated
ASCII word boundary even if the caller didn't disable the UTF-8 invariant.
This is wrong, since a negated ASCII word boundary can match between any
two arbitrary bytes. However, fixing this is a breaking change, so for
now we document the bug. We plan to fix it with regex 1.0. See #457.
Additionally, we were incorrectly declaring that an ASCII word boundary
matched invalid UTF-8 via the Hir::is_always_utf8 property. An ASCII word
boundary must always match an ASCII byte on one side, which implies a
valid UTF-8 position.
This removes our compile time SIMD flags and replaces them with the
`unstable` feature, which will cause CI to use whatever CPU features are
available.
Ideally, we would test each important CPU feature combinations, but I'd
like to avoid doing that in one CI job and instead split them out into
separate CI jobs to keep CI times low. That requires more work.
This commit adds a copy of the Teddy searcher that works on AVX2. We
don't attempt to reuse any code between them just yet, and instead just
copy & paste and tweak parts of it to work on 32 bytes instead of 16.
(Some parts were trickier than others. For example, @jneem figured out
how to nearly compensate for the lack of a real 256-bit bytewise PALIGNR
instruction, which we borrow here.)
Overall, AVX2 provides a nice bump in performance.
This commit ports the Teddy searcher to use std::arch and moves off the
portable SIMD vector API. Performance remains the same, and it looks
like the codegen is identical, which is great!
This also makes the `simd-accel` feature a no-op and adds a new
`unstable` feature which will enable the Teddy optimization. The `-C
target-feature` or `-C target-cpu` settings are no longer necessary,
since this will now do runtime target feature detection.
We also add a new `unstable` feature to the regex crate, which will
enable this new use of std::arch. Once enabled, the Teddy optimizations
becomes available automatically without any additional compile time
flags.
This fixes a bug in the parser where a regex like `(?x)[ / - ]` would
fail to parse. In particular, since whitespace insensitive mode is
enabled, this regex should be equivalent to `[/-]`, where the `-` is
treated as a literal `-` instead of a range since it is the last
character in the class. However, the parser did not account for
whitespace insensitive mode, so it didn't see the `-` in `(?x)[ / - ]`
as trailing, and therefore reported an unclosed character class (since
the `]` was treated as part of the range).
We fix that in this commit by accounting for whitespace insensitive
mode, which we do by adding a `peek` method that skips over whitespace.
Fixes#455
This commit improves the DFA's `follow_episilons` routine slightly. In
particular, it eliminates a sizable chunk of stack operations by using
a normal linear loop. The only time we use the stack is for a Split
instruction, which is still admittedly quite common. However, as we improve
the byte code, many of the Split instructions should go away.
Note that this is the same technique used by the backtracking and PikeVM
engines.
This commit fixes an embarrassing bug where the depth in the nest limit
checker was never decremented during postorder traversal, which means
long but shallow regexes would incorrectly trip the nest limit. We fix
that in this commit and add two regression tests.
Fixes#454
If you just generate two random strings, the odds are very high
that the shorter one won't be a substring of the longer one once
they reach any substantial length. This means that the existing
quickcheck cases were probably just testing the negative cases.
The exception would be the two cases that append the needle
to the haystack, but those only test behavior at the ends. This
patch adds a better quickcheck case that can test a needle anywhere
in the haystack.
Fixes#446
This commit does the mechanical changes necessary to remove the old
regex-syntax crate and replace it with the rewrite. The rewrite now
subsumes the `regex-syntax` crate name, and gets a semver bump to 0.5.0.
This commit adds an explicit Debug impl for regex's main Error type.
The purpose of this impl is to format parse errors in normal panic
messages more nicely. This is slightly idiosyncratic, but the default
Debug impl prints the full string anyway, we might as well format it
nicely.
See also: #450
This commit provides exhaustive documentation for the regex crate's support
for Level 1 ("Basic Unicode Support") as documented in UTS#18.
We also document the small number of additions added to the concrete
syntax as a result of the regex-syntax rewrite.
See: http://unicode.org/reports/tr18/
With the regex syntax rewrite, we now support empty subexpressions more
officially. Unfortunately, the compiler has trouble with empty
subexpressions in alternation branches. There's no particular reason to
not support for them, but they are difficult/awkward to express with the
current compiler. So just ban them for now.
If one does need an empty subexpression in an alternate branch, then
amusingly, something like `()?|z` will work. We could rewrite all such
empty sub-expressions into `()?`, which would retain the same match
semantics, but we choose to take the most conservative change possible.
This commit moves the entire regex crate over to the regex-syntax-2
rewrite. Most of this is just rewriting types.
The compiler got the most interesting set of changes. It got simpler
in some respects, but not significantly so.
This commit represents a ground up rewrite of the regex-syntax crate.
This commit is also an intermediate state. That is, it adds a new
regex-syntax-2 crate without making any serious changes to any other
code. Subsequent commits will cover the integration of the rewrite and
the removal of the old crate.
The rewrite is intended to be the first phase in an effort to overhaul
the entire regex crate. To that end, this rewrite takes steps in that
direction:
* The principle change in the public API is an explicit split between a
regular expression's abstract syntax (AST) and a high-level
intermediate representation (HIR) that is easier to analyze. The old
version of this crate mixes these two concepts, but leaned heavily
towards an HIR. The AST in the rewrite has a much closer
correspondence with the concrete syntax than the old `Expr` type does.
The new HIR embraces its role; all flags are now compiled away
(including the `i` flag), which will simplify subsequent passes,
including literal detection and the compiler. ASTs are produced by
ast::parse and HIR is produced by hir::translate. A top-level parser
is provided that combines these so that callers can skip straight from
concrete syntax to HIR.
* Error messages are vastly improved thanks to the span information that
is now embedded in the AST. In addition to better formatting, error
messages now also include helpful hints when trying to use features
that aren't supported (like backreferences and look-around). In
particular, octal support is now an opt-in option. (Octal support
will continue to be enabled in regex proper to support backwards
compatibility, but will be disabled in 1.0.)
* More robust support for Unicode Level 1 as described in UTS#18.
In particular, we now fully support Unicode character classes
including set notation (difference, intersection, symmetric
difference) and correct support for named general categories, scripts,
script extensions and age. That is, `\p{scx:Hira}` and `p{age:3.0}`
now work. To make this work, we introduce an internal interval set
data structure.
* With the exception of literal extraction (which will be overhauled in
a later phase), all code in the rewrite uses constant stack space,
even while performing analysis that requires structural induction over
the AST or HIR. This is done by pushing the call stack onto the heap,
and is abstracted by the `ast::Visitor` and `hir::Visitor` traits.
The point of this method is to eliminate stack overflows in the
general case.
* Empty sub-expressions are now properly supported. Expressions like
`()`, `|`, `a|` and `b|()+` are now valid syntax.
The principle downsides of these changes are parse time and binary size.
Both seemed to have increased (slower and bigger) by about 1.5x. Parse
time is generally peanuts compared to the compiler, so we mostly don't
care about that. Binary size is mildly unfortunate, and if it becomes a
serious issue, it should be possible to introduce a feature that
disables some level of Unicode support and/or work on compressing the
Unicode tables. Compile times have increased slightly, but are still a
very small fraction of the overall time it takes to compile `regex`.
Fixes#174, Fixes#424
This permits use of a Replacer without consuming it.
Note: This can't simply return `&mut Self` because a generic
`impl<R: Replacer> Replacer for &mut R` would conflict with libstd's
generic `impl<F: FnMut> FnMut for &mut F`.
See also: #83Closes#449
This patch fixes an issue where skip resolution would go strait
to the default value (the md2_shift) on a match failure after
the shift_loop. Now we do the right thing, and first check in
the skip table. The problem with going strait to the md2_shift
is that you can accidentally shift to far when `window_end`
actually is in the pattern (as is the case for the failing
match).
This has apparently been broken for a while, and with docs.rs, we don't
need it any more.
Tangentially, this method seemingly required a personal access token, which
seems like a bad idea in a shared repo.
As a special case, if the user configures a DFA size limit of 0, then we
should never try to use it. This avoids a bit of thrashing where the DFA
tries to senselessly run before spilling over to the NFA.