This commit brings the utf8-ranges crate into regex-syntax as a utf8
sub-module.
This was done because it was observed that utf8-ranges is effectively
unused outside the context of regex-syntax. It is a very small amount of
code, and fits alongside the rest of regex-syntax. In particular, anyone
building a regex engine using regex-syntax will likely need this code
anyway.
Rust 1.28 is almost a year old by this point, and there were a number of
nice stabilizations between 1.24 and 1.28. Notably, vendor intrinsics were
stabilized in Rust 1.26, so we no longer need a build script.
The problem with putting it in the regex crate proper is that it
requires the regex crate to bump its minimal regex-syntax crate version.
While this isn't necessarily an issue, since we can't enable Cargo's
minimal version check because of the `rand` dependency, this winds up
being a hazard. Plus, having it in the regex crate doesn't buy us too
much. It's just as well to have the tests in regex-syntax.
Fixes#593
This fixes yet another bug with our handling of (?flags) directives in
the regex. This time, we try to be a bit more principled and
specifically treat a (?flags) directive as a valid empty sub-expression.
While this means we could remove errors reported from previous fixes for
things like `(?i)+`, we retain those for now since they are a bit weird.
Although `((?i))+` is now allowed, which is equivalent. We should
probably allow `(?i)+` in the future for consistency sake.
Fixes#527
This fixes a bug where the HIR translator would panic on regexes such as
`(?i){1}` since it assumes that every repetition operator has a valid
sub-expression, and `(?i)` is not actually a sub-expression (but is more
like a directive instead).
Previously, we fixed this same bug for *uncounted* repetitions in commit
17764ffe (for bug #465), but we did not fix it for counted repetitions.
We apply the same fix here.
Fixes#555
This adds a couple new methods on HIR expressions for determining whether
they are literals or not. This is useful for determining whether to apply
optimizations such as Aho-Corasick without re-analyzing the syntax.
This commit adds two new predicates to `Hir` values that permit querying
whether an expression is *line* anchored at the start or end.
This was motivated by a desire to tweak the offsets of a match when
enabling --crlf mode in ripgrep.
This commit adds several emoji properties such as Emoji and
Extended_Pictographic. We also add support for the Grapheme_Cluster_Break,
Word_Break and Sentence_Break enumeration properties.
Ensure `[[:blank:]]` only matches `[ \t]`. It appears that there was
a transcription error when `regex-syntax` was rewritten such that
`[[:blank:]]` ended up matching more than it was supposed to.
Fixes#533
This updates the documentation on `allow_invalid_utf8` to reflect the
current behavior of the translator. The old documentation was describing
the behavior of regex-syntax 0.5, but it was changed in regex-syntax
0.6.
This adds `scripts/generate.py`, and uses it to regenerate all tables
with data from Unicode 11.0.0. This also restores the character tests
that were first added in #400, with a new one for 11.
The issue with the ASCII version of \B is that it can match between code
units of UTF-8, which means it can cause match indices reported to be on
invalid UTF-8 boundaries. Therefore, similar to things like `(?-u:\xFF)`,
we ban negated ASCII word boundaries from Unicode regular expressions.
Normal ASCII word boundaries remain accessible from Unicode regular
expressions.
See #457
This commit removes our explicit implementations of encode_utf8 and
replaces them with uses of `char::encode_utf8`, which was added to the
standard library in Rust 1.15.
This commit fixes a bug with the handling of `(?flags)` sub-expressions
in the parser. Previously, the parser read `(?flags)`, added it to the
current concatenation, and then treat that as a valid sub-expression for
repetition operators, as in `(?i)*`. This in turn caused the translator
to panic on a failed assumption: that witnessing a repetition operator
necessarily implies a preceding sub-expression. But `(?i)` has no
explicit represents in the HIR, so there is no sub-expression.
There are two legitimate ways to fix this:
1. Ban such constructions in the parser.
2. Remove the assumption in the translator, and/or always translate a
`(?i)` into an empty sub-expression, which should generally be a
no-op.
This commit chooses (1) because it is more conservative. That is, it
turns a panic into an error, which gives us flexibility in the future to
choose (2) if necessary.
Fixes#465
This re-generates the Unicode table for property name aliases after fixing
a bug in property name canonicalization. Namely, the 'isc' alias of the
'ISO_Comment' property was being canonicalized to 'c', which is actually
an alias of the 'Other' general category. This is a result of the
canonicalization procedure ignoring 'is' prefixes, as permitted by UTS#18.
Fixes#466
This commit adds a new type of error message that is used whenever a
character class escape sequence is used as the start or end of a
character class range.
Fixes#461
This fixes an off-by-one bug in the error formatter. Namely, if a regex
ends with a literal `\n` *and* an error is reported that contains a span
at the end of the regex, then this trips a bug in the formatter because
its line count ends up being wrong. We fix this by tweaking the line
count. The actual error message is still a little wonky, but given the
literal `\n`, it's hard not to make it wonky.
Fixes#464
This adds a printer for the high-level intermediate representation. The
regex it prints is valid, and can be used as a way to turn it into a
regex::Regex.
Previously, we had some inconsistencies in how we were handling ASCII
word boundaries. In particular, the translator was accepting a negated
ASCII word boundary even if the caller didn't disable the UTF-8 invariant.
This is wrong, since a negated ASCII word boundary can match between any
two arbitrary bytes. However, fixing this is a breaking change, so for
now we document the bug. We plan to fix it with regex 1.0. See #457.
Additionally, we were incorrectly declaring that an ASCII word boundary
matched invalid UTF-8 via the Hir::is_always_utf8 property. An ASCII word
boundary must always match an ASCII byte on one side, which implies a
valid UTF-8 position.
This fixes a bug in the parser where a regex like `(?x)[ / - ]` would
fail to parse. In particular, since whitespace insensitive mode is
enabled, this regex should be equivalent to `[/-]`, where the `-` is
treated as a literal `-` instead of a range since it is the last
character in the class. However, the parser did not account for
whitespace insensitive mode, so it didn't see the `-` in `(?x)[ / - ]`
as trailing, and therefore reported an unclosed character class (since
the `]` was treated as part of the range).
We fix that in this commit by accounting for whitespace insensitive
mode, which we do by adding a `peek` method that skips over whitespace.
Fixes#455
This commit fixes an embarrassing bug where the depth in the nest limit
checker was never decremented during postorder traversal, which means
long but shallow regexes would incorrectly trip the nest limit. We fix
that in this commit and add two regression tests.
Fixes#454