Some lints have been intentionally ignored, especially:
* any lints that would change public APIs (like &self -> self)
* any lints that would introduce new public APIs (like Default over new)
Previously, 'ab??' returned [Complete(ab), Complete(a)], but the order
matters here because of greediness. The correct result is [Complete(a),
Complete(ab)].
Instead of trying to actually fix literal extraction (which is a mess),
we just rewrite 'ab?' (and 'ab??') as 'ab*'. 'ab*' still produces
literals in the incorrect order, i.e., [Cut(ab), Complete(a)], but since
one is cut we are guaranteed that the regex engine will be called to
confirm the match. In so doing, it will correctly report 'a' as a match
for 'ab??' in 'ab'.
Fixes#862
This was incorrectly defined for \b. Previously, I had erroneously made
it return true only for \B since \B matches '' and \b does not match
''. However, \b does match the empty string. Like \B, it only matches a
subset of empty strings, depending on what the surrounding context is.
The important bit is that it can match *an* empty string, not that it
matches *the* empty string.
We were not yet using this predicate anywhere in the regex crate, so we
just fix the implementation and update the tests.
This does present a compatibility hazard for anyone who was using this
function, but as of this time, I'm considering this a bug fix since \b
clearly matches an empty string.
Fixes#859
This fixes a bug in how ASCII class unioning was implemented. Namely, it
previously and erroneously unioned together two classes and then applied
negation/case-folding based on the most recently added class, even if
the class added previously wasn't negated. So for example, given the
regex '[[:alnum:][:^ascii:]]', this would initialize the class with
'[:alnum:]', then add all '[:^ascii:]' codepoints and then negate the
entire thing because of the negation in '[:^ascii:]'. Negating the
entire thing is clearly wrong and not the intended semantics.
We fix this by applying negation/case-folding only to the class we're
dealing with, and then we union it with whatever existing class we're
building.
Fixes#680
When only the unicode-perl feature is enabled, regex-syntax would fail
to build. It turns out that 'cargo fix' doesn't actually fix all
imports. It looks like it only fixes things that it can build in the
current configuration.
Fixes#769, Fixes#770
This commit does a number of manual fixups to the code after the
previous two commits were done via 'cargo fix' automatically.
Actually, this contains more 'cargo fix' annotations, since I had
forgotten to add 'edition = "2018"' to all sub-crates.
Previously, the translator would forbid constructs like [^\w\W] that
compiled to empty character classes. These things are forbidden not
because the translator can't handle it, but because the compile in
'regex' proper can't handle it. Once we migrate to the compiler in
regex-automata, which supports empty classes, then we can lift this
restriction. But until then, we should ban all such instances. It turns
out that \P{any} was another way to utter this, so we ban it in this
commit.
This was found by OSS-Fuzz:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=26505Fixes#722
It turns out that 'cf' is also an abbreviation for the 'Case_Folding'
property. Even though we don't actually support a 'Case_Folding'
property, a quirk of our code caused 'cf' to fail since it was treated
as a normal boolean property instead of a general category. We fix it be
special casing it.
Note that '\p{gc=cf}' worked and continues to work.
If we ever do add the 'Case_Folding' property, we'll not be able to
support its abbreviation since it is now taken by 'Format'.
Fixes#719
This slightly expands the set of characters allowed in capture group
names to be `[][_0-9A-Za-z.]` from `[_0-9A-Za-z]`.
This required some delicacy in order to avoid replacement strings like
`$Z[` from referring to invalid capture group names where the intent was
to refer to the capture group named `Z`. That is, in order to use `[`,
`]` or `.` in a capture group name, one must use the explicit brace
syntax: `${Z[}`. We clarify the docs around this issue.
Regretably, we are not much closer to handling #595. In order to
support, say, all Unicode word characters, our replacement parser would
need to become UTF-8 aware on `&[u8]`. But std makes this difficult and
I would prefer not to add another dependency on ad hoc UTF-8 decoding or
a dependency on another crate.
Closes#649
To avoid this assertion in tests when empty alternations are allowed:
internal error: entered unreachable code: expected literal or
concat, got Hir { kind: Empty, info: HirInfo { bools: 1795 } }',
src/exec.rs:1568:18
The code in exec.rs relies on the documented invariant for
is_alternation_literal:
/// ... This is only true when this HIR expression is either
/// itself a `Literal` or a concatenation of only `Literal`s or an
/// alternation of only `Literal`s.
This mirrors the same routine on ClassBytes. This is useful when
translating an HIR to an NFA and one wants to write a fast path for the
common all ASCII case.