This fixes a bug in how ASCII class unioning was implemented. Namely, it
previously and erroneously unioned together two classes and then applied
negation/case-folding based on the most recently added class, even if
the class added previously wasn't negated. So for example, given the
regex '[[:alnum:][:^ascii:]]', this would initialize the class with
'[:alnum:]', then add all '[:^ascii:]' codepoints and then negate the
entire thing because of the negation in '[:^ascii:]'. Negating the
entire thing is clearly wrong and not the intended semantics.
We fix this by applying negation/case-folding only to the class we're
dealing with, and then we union it with whatever existing class we're
building.
Fixes#680
When only the unicode-perl feature is enabled, regex-syntax would fail
to build. It turns out that 'cargo fix' doesn't actually fix all
imports. It looks like it only fixes things that it can build in the
current configuration.
Fixes#769, Fixes#770
This commit does a number of manual fixups to the code after the
previous two commits were done via 'cargo fix' automatically.
Actually, this contains more 'cargo fix' annotations, since I had
forgotten to add 'edition = "2018"' to all sub-crates.
Previously, the translator would forbid constructs like [^\w\W] that
compiled to empty character classes. These things are forbidden not
because the translator can't handle it, but because the compile in
'regex' proper can't handle it. Once we migrate to the compiler in
regex-automata, which supports empty classes, then we can lift this
restriction. But until then, we should ban all such instances. It turns
out that \P{any} was another way to utter this, so we ban it in this
commit.
This was found by OSS-Fuzz:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=26505Fixes#722
It turns out that 'cf' is also an abbreviation for the 'Case_Folding'
property. Even though we don't actually support a 'Case_Folding'
property, a quirk of our code caused 'cf' to fail since it was treated
as a normal boolean property instead of a general category. We fix it be
special casing it.
Note that '\p{gc=cf}' worked and continues to work.
If we ever do add the 'Case_Folding' property, we'll not be able to
support its abbreviation since it is now taken by 'Format'.
Fixes#719
This slightly expands the set of characters allowed in capture group
names to be `[][_0-9A-Za-z.]` from `[_0-9A-Za-z]`.
This required some delicacy in order to avoid replacement strings like
`$Z[` from referring to invalid capture group names where the intent was
to refer to the capture group named `Z`. That is, in order to use `[`,
`]` or `.` in a capture group name, one must use the explicit brace
syntax: `${Z[}`. We clarify the docs around this issue.
Regretably, we are not much closer to handling #595. In order to
support, say, all Unicode word characters, our replacement parser would
need to become UTF-8 aware on `&[u8]`. But std makes this difficult and
I would prefer not to add another dependency on ad hoc UTF-8 decoding or
a dependency on another crate.
Closes#649
To avoid this assertion in tests when empty alternations are allowed:
internal error: entered unreachable code: expected literal or
concat, got Hir { kind: Empty, info: HirInfo { bools: 1795 } }',
src/exec.rs:1568:18
The code in exec.rs relies on the documented invariant for
is_alternation_literal:
/// ... This is only true when this HIR expression is either
/// itself a `Literal` or a concatenation of only `Literal`s or an
/// alternation of only `Literal`s.
This mirrors the same routine on ClassBytes. This is useful when
translating an HIR to an NFA and one wants to write a fast path for the
common all ASCII case.
This fixes a rather nasty bug where flags set inside a group were being
applies to expressions outside the group. e.g., In the simplest case,
`((?i)a)b)` would match `aB`, even though the case insensitive flag
_shouldn't_ be applied to `b`.
The issue here was that we were actually going out of our way to reset
the flags when a group is popped only _some_ of the time. Namely, when
flags were set via `(?i:a)b` syntax. Instead, flags should be reset to
their previous state _every_ time a group is popped in the translator.
The fix here is pretty simple. When we open a group, if the group itself
does not have any flags, then we simply record the current state of the
flags instead of trying to replace the current flags. Then, when we pop
the group, we are guaranteed to obtain the old flags, at which point, we
reset them.
Fixes#640
PR #633 removed these methods, but we can't do that without making a
breaking change release. Removing deprecated methods isn't worth doing a
breaking change release, so we instead simply allow them for now by
squashing the warnings.
Closes#633
This commit refactors the way this library handles Unicode data by
making it completely optional. Several features are introduced which
permit callers to select only the Unicode data they need (up to a point
of granularity).
An important property of these changes is that presence of absence of
crate features will never change the match semantics of a regular
expression. Instead, the presence or absence of a crate feature can only
add or subtract from the set of all possible valid regular expressions.
So for example, if the `unicode-case` feature is disabled, then
attempting to produce `Hir` for the regex `(?i)a` will fail. Instead,
callers must use `(?i-u)a` (or enable the `unicode-case` feature).
This partially addresses #583 since it permits callers to decrease
binary size.
This nominally moves the logic for acquiring Unicode-aware Perl character
classes into the `unicode` module, and also makes the calling code
robust with respect to failures.
This commit is prep work for making the availability of Unicode-aware
Perl classes optional.