To avoid this assertion in tests when empty alternations are allowed:
internal error: entered unreachable code: expected literal or
concat, got Hir { kind: Empty, info: HirInfo { bools: 1795 } }',
src/exec.rs:1568:18
The code in exec.rs relies on the documented invariant for
is_alternation_literal:
/// ... This is only true when this HIR expression is either
/// itself a `Literal` or a concatenation of only `Literal`s or an
/// alternation of only `Literal`s.
This mirrors the same routine on ClassBytes. This is useful when
translating an HIR to an NFA and one wants to write a fast path for the
common all ASCII case.
This fixes a rather nasty bug where flags set inside a group were being
applies to expressions outside the group. e.g., In the simplest case,
`((?i)a)b)` would match `aB`, even though the case insensitive flag
_shouldn't_ be applied to `b`.
The issue here was that we were actually going out of our way to reset
the flags when a group is popped only _some_ of the time. Namely, when
flags were set via `(?i:a)b` syntax. Instead, flags should be reset to
their previous state _every_ time a group is popped in the translator.
The fix here is pretty simple. When we open a group, if the group itself
does not have any flags, then we simply record the current state of the
flags instead of trying to replace the current flags. Then, when we pop
the group, we are guaranteed to obtain the old flags, at which point, we
reset them.
Fixes#640
PR #633 removed these methods, but we can't do that without making a
breaking change release. Removing deprecated methods isn't worth doing a
breaking change release, so we instead simply allow them for now by
squashing the warnings.
Closes#633
This commit refactors the way this library handles Unicode data by
making it completely optional. Several features are introduced which
permit callers to select only the Unicode data they need (up to a point
of granularity).
An important property of these changes is that presence of absence of
crate features will never change the match semantics of a regular
expression. Instead, the presence or absence of a crate feature can only
add or subtract from the set of all possible valid regular expressions.
So for example, if the `unicode-case` feature is disabled, then
attempting to produce `Hir` for the regex `(?i)a` will fail. Instead,
callers must use `(?i-u)a` (or enable the `unicode-case` feature).
This partially addresses #583 since it permits callers to decrease
binary size.
This nominally moves the logic for acquiring Unicode-aware Perl character
classes into the `unicode` module, and also makes the calling code
robust with respect to failures.
This commit is prep work for making the availability of Unicode-aware
Perl classes optional.
This one was a bit hard to swallow because it involved copying a
fairly short but not terribly simple function for normalizing property
names/values. But the code is so small, changes rarely, and is easily
tested, that it's just not worth bringing in a whole dependency for it
given how big regex-syntax already is.
This commit brings the utf8-ranges crate into regex-syntax as a utf8
sub-module.
This was done because it was observed that utf8-ranges is effectively
unused outside the context of regex-syntax. It is a very small amount of
code, and fits alongside the rest of regex-syntax. In particular, anyone
building a regex engine using regex-syntax will likely need this code
anyway.
Rust 1.28 is almost a year old by this point, and there were a number of
nice stabilizations between 1.24 and 1.28. Notably, vendor intrinsics were
stabilized in Rust 1.26, so we no longer need a build script.
The problem with putting it in the regex crate proper is that it
requires the regex crate to bump its minimal regex-syntax crate version.
While this isn't necessarily an issue, since we can't enable Cargo's
minimal version check because of the `rand` dependency, this winds up
being a hazard. Plus, having it in the regex crate doesn't buy us too
much. It's just as well to have the tests in regex-syntax.
Fixes#593
This fixes yet another bug with our handling of (?flags) directives in
the regex. This time, we try to be a bit more principled and
specifically treat a (?flags) directive as a valid empty sub-expression.
While this means we could remove errors reported from previous fixes for
things like `(?i)+`, we retain those for now since they are a bit weird.
Although `((?i))+` is now allowed, which is equivalent. We should
probably allow `(?i)+` in the future for consistency sake.
Fixes#527
This fixes a bug where the HIR translator would panic on regexes such as
`(?i){1}` since it assumes that every repetition operator has a valid
sub-expression, and `(?i)` is not actually a sub-expression (but is more
like a directive instead).
Previously, we fixed this same bug for *uncounted* repetitions in commit
17764ffe (for bug #465), but we did not fix it for counted repetitions.
We apply the same fix here.
Fixes#555
This adds a couple new methods on HIR expressions for determining whether
they are literals or not. This is useful for determining whether to apply
optimizations such as Aho-Corasick without re-analyzing the syntax.
This commit adds two new predicates to `Hir` values that permit querying
whether an expression is *line* anchored at the start or end.
This was motivated by a desire to tweak the offsets of a match when
enabling --crlf mode in ripgrep.
This commit adds several emoji properties such as Emoji and
Extended_Pictographic. We also add support for the Grapheme_Cluster_Break,
Word_Break and Sentence_Break enumeration properties.