136 Commits

Author SHA1 Message Date
Andrew Gallant
169783c1d6
syntax: release 0.6.11 2019-08-03 16:10:47 -04:00
Andrew Gallant
b4c67cb80c syntax: drop ucd_util dependency
This one was a bit hard to swallow because it involved copying a
fairly short but not terribly simple function for normalizing property
names/values. But the code is so small, changes rarely, and is easily
tested, that it's just not worth bringing in a whole dependency for it
given how big regex-syntax already is.
2019-08-03 16:09:49 -04:00
Andrew Gallant
caa075f653 syntax: absorb utf8-ranges crate
This commit brings the utf8-ranges crate into regex-syntax as a utf8
sub-module.

This was done because it was observed that utf8-ranges is effectively
unused outside the context of regex-syntax. It is a very small amount of
code, and fits alongside the rest of regex-syntax. In particular, anyone
building a regex engine using regex-syntax will likely need this code
anyway.
2019-08-03 16:09:49 -04:00
Andrew Gallant
fc3e6aa19a
license: remove license headers from files
The Rust project determined these were unnecessary a while back[1,2,3]
and we follow suite.

[1] - 0565653eec
[2] - https://github.com/rust-lang/rust/pull/43498
[3] - https://github.com/rust-lang/rust/pull/57108
2019-08-03 14:47:45 -04:00
Andrew Gallant
0e96af4166
style: start using rustfmt 2019-08-03 14:20:22 -04:00
Andrew Gallant
341f207c10
regex-syntax-0.6.10 2019-07-20 23:01:44 -04:00
Andrew Gallant
dc111a5f19
syntax: update Unicode ages lookup
This was a missed fix for the Unicode 12.1 update.
2019-07-20 23:01:23 -04:00
Andrew Gallant
0c57ea14ea
syntax: release 0.6.9 2019-07-20 22:46:46 -04:00
Andrew Gallant
3124a3b2ca
syntax: update to Unicode 12.1 2019-07-20 22:45:39 -04:00
Andrew Gallant
918350a59b
msrv: bump to Rust 1.28
Rust 1.28 is almost a year old by this point, and there were a number of
nice stabilizations between 1.24 and 1.28. Notably, vendor intrinsics were
stabilized in Rust 1.26, so we no longer need a build script.
2019-07-20 22:35:18 -04:00
Gurwinder Singh
dfe0dc6493 syntax/doc: fix typo 2019-07-14 08:04:21 -04:00
Andrew Gallant
62b7b508fa
regex-syntax-0.6.8 2019-07-06 09:16:20 -04:00
Andrew Gallant
886a7e7185
syntax: move error test to syntax crate
The problem with putting it in the regex crate proper is that it
requires the regex crate to bump its minimal regex-syntax crate version.
While this isn't necessarily an issue, since we can't enable Cargo's
minimal version check because of the `rand` dependency, this winds up
being a hazard. Plus, having it in the regex crate doesn't buy us too
much. It's just as well to have the tests in regex-syntax.

Fixes #593
2019-07-06 09:15:11 -04:00
Christian Rondeau
172898a4fd syntax: better errors missing repetition quantifier
This change causes a better error message to surface when
a repetition quantifier is used with a missing number.

Closes #545
2019-06-11 07:45:27 -04:00
Andrew Gallant
3ffe9a20b8
regex-syntax-0.6.7 2019-06-09 08:57:15 -04:00
Andrew Gallant
53270d8232
syntax: fix warnings
The language team is getting deprecation-happy with old syntax. But Rust
1.24.1 doesn't support inclusive range syntax, so we forcefully allow
it.
2019-06-09 08:49:06 -04:00
Andrew Gallant
89074f87d0
1.1.3 2019-03-30 10:53:01 -04:00
Andrew Gallant
231643248b syntax: fix bug when parsing ((?x))
This fixes yet another bug with our handling of (?flags) directives in
the regex. This time, we try to be a bit more principled and
specifically treat a (?flags) directive as a valid empty sub-expression.
While this means we could remove errors reported from previous fixes for
things like `(?i)+`, we retain those for now since they are a bit weird.
Although `((?i))+` is now allowed, which is equivalent. We should
probably allow `(?i)+` in the future for consistency sake.

Fixes #527
2019-03-30 10:47:45 -04:00
Andrew Gallant
7b1599f2f6 syntax: fix counted repetition bug
This fixes a bug where the HIR translator would panic on regexes such as
`(?i){1}` since it assumes that every repetition operator has a valid
sub-expression, and `(?i)` is not actually a sub-expression (but is more
like a directive instead).

Previously, we fixed this same bug for *uncounted* repetitions in commit
17764ffe (for bug #465), but we did not fix it for counted repetitions.
We apply the same fix here.

Fixes #555
2019-03-30 10:47:45 -04:00
Andrew Gallant
bd5f2b4be5 syntax: add is_literal and is_alternation_literal
This adds a couple new methods on HIR expressions for determining whether
they are literals or not. This is useful for determining whether to apply
optimizations such as Aho-Corasick without re-analyzing the syntax.
2019-03-30 08:18:19 -04:00
Andrew Gallant
60d087a230
regex-syntax-0.6.5 2019-01-26 11:14:37 -05:00
Andrew Gallant
0fc24d275a
syntax: add is_line_anchored_{start,end}
This commit adds two new predicates to `Hir` values that permit querying
whether an expression is *line* anchored at the start or end.

This was motivated by a desire to tweak the offsets of a match when
enabling --crlf mode in ripgrep.
2019-01-26 11:14:27 -05:00
Andrew Gallant
b77e3fca8a
regex-syntax-0.6.4 2018-11-30 22:05:18 -05:00
Daniel Holbert
e214d8cd88 doc: Fix typo in comment ("ocassionally")
PR #515
2018-11-30 20:02:29 -05:00
Andrew Gallant
ecc1a5a70d syntax: add emoji and break properties
This commit adds several emoji properties such as Emoji and
Extended_Pictographic. We also add support for the Grapheme_Cluster_Break,
Word_Break and Sentence_Break enumeration properties.
2018-11-30 20:00:49 -05:00
Andrew Gallant
770edd59b2
regex-syntax-0.6.3 2018-11-07 17:20:08 -05:00
Derek Gonyeo
ce4154365f syntax/license: add the unicode license for unicode-tables
Add the Unicode license to the unicode-tables directory, as the data
there comes from the Unicode Consortium.

Fixes #530
2018-11-07 17:19:51 -05:00
kennytm
5241919f48 syntax: fix [[:blank:]] character class
Ensure `[[:blank:]]` only matches `[ \t]`. It appears that there was
a transcription error when `regex-syntax` was rewritten such that
`[[:blank:]]` ended up matching more than it was supposed to.

Fixes #533
2018-10-29 08:24:15 -04:00
Andrew Gallant
8421c9ae85
regex-syntax 0.6.2 2018-07-18 09:24:25 -04:00
Andrew Gallant
24c7770b80
syntax: fix printing bug for HIR
This commit fixes a bug in the HIR printer where it would not correctly
escape meta characters in character classes.
2018-07-18 09:15:27 -04:00
Andrew Gallant
7ebe4ae02d
syntax: update docs to reflect behavior
This updates the documentation on `allow_invalid_utf8` to reflect the
current behavior of the translator. The old documentation was describing
the behavior of regex-syntax 0.5, but it was changed in regex-syntax
0.6.
2018-07-18 09:14:26 -04:00
Andrew Gallant
bf8f55f187
regex-syntax-0.6.1 2018-06-12 06:55:06 -04:00
Josh Stone
5eaff67a6a syntax: regenerate tables for Unicode 11
This adds `scripts/generate.py`, and uses it to regenerate all tables
with data from Unicode 11.0.0.  This also restores the character tests
that were first added in #400, with a new one for 11.
2018-06-12 06:54:13 -04:00
Andrew Gallant
b5ef0ec281
regex 1.0 2018-05-01 16:52:05 -04:00
Andrew Gallant
8e180eb71f syntax: fixes for Rust 1.20.0
Make sure we can run tests for regex-syntax on Rust 1.20.0.
2018-05-01 16:48:46 -04:00
Andrew Gallant
76343f8cd6 regex: ban (?-u:\B) for Unicode regexes
The issue with the ASCII version of \B is that it can match between code
units of UTF-8, which means it can cause match indices reported to be on
invalid UTF-8 boundaries. Therefore, similar to things like `(?-u:\xFF)`,
we ban negated ASCII word boundaries from Unicode regular expressions.
Normal ASCII word boundaries remain accessible from Unicode regular
expressions.

See #457
2018-05-01 16:48:46 -04:00
Andrew Gallant
9604cc07ed unicode: remove implementations of encode_utf8
This commit removes our explicit implementations of encode_utf8 and
replaces them with uses of `char::encode_utf8`, which was added to the
standard library in Rust 1.15.
2018-05-01 16:48:46 -04:00
Andrew Gallant
05ab8f318d *: switch from try! to ? 2018-05-01 16:48:46 -04:00
Andrew Gallant
92e7baf584
regex-syntax 0.5.6 2018-05-01 13:28:53 -04:00
Andrew Gallant
17764ffe17
syntax: fix handling of (?flags) in parser
This commit fixes a bug with the handling of `(?flags)` sub-expressions
in the parser. Previously, the parser read `(?flags)`, added it to the
current concatenation, and then treat that as a valid sub-expression for
repetition operators, as in `(?i)*`. This in turn caused the translator
to panic on a failed assumption: that witnessing a repetition operator
necessarily implies a preceding sub-expression. But `(?i)` has no
explicit represents in the HIR, so there is no sub-expression.

There are two legitimate ways to fix this:

1. Ban such constructions in the parser.
2. Remove the assumption in the translator, and/or always translate a
   `(?i)` into an empty sub-expression, which should generally be a
   no-op.

This commit chooses (1) because it is more conservative. That is, it
turns a panic into an error, which gives us flexibility in the future to
choose (2) if necessary.

Fixes #465
2018-04-28 12:02:39 -04:00
Andrew Gallant
d5e5da68e2
syntax: fix 'C' alias bug
This re-generates the Unicode table for property name aliases after fixing
a bug in property name canonicalization. Namely, the 'isc' alias of the
'ISO_Comment' property was being canonicalized to 'c', which is actually
an alias of the 'Other' general category. This is a result of the
canonicalization procedure ignoring 'is' prefixes, as permitted by UTS#18.

Fixes #466
2018-04-28 10:44:41 -04:00
Andrew Gallant
f7ea409880
syntax: better error messages for '[\d-a]'
This commit adds a new type of error message that is used whenever a
character class escape sequence is used as the start or end of a
character class range.

Fixes #461
2018-04-28 09:50:25 -04:00
Andrew Gallant
15a68c8856
regex-syntax 0.5.5 2018-04-14 16:44:01 -04:00
Andrew Gallant
9ba9a758c2
syntax: fix bug in error printer
This fixes an off-by-one bug in the error formatter. Namely, if a regex
ends with a literal `\n` *and* an error is reported that contains a span
at the end of the regex, then this trips a bug in the formatter because
its line count ends up being wrong. We fix this by tweaking the line
count. The actual error message is still a little wonky, but given the
literal `\n`, it's hard not to make it wonky.

Fixes #464
2018-04-14 16:35:02 -04:00
Andrew Gallant
dba7f3b041
regex-syntax-0.5.3 2018-03-13 21:44:49 -04:00
Andrew Gallant
97651fb604 syntax/hir: add a printer for HIR
This adds a printer for the high-level intermediate representation. The
regex it prints is valid, and can be used as a way to turn it into a
regex::Regex.
2018-03-13 21:44:08 -04:00
Andrew Gallant
c230e59468 syntax/hir: fix handling of ASCII word boundaries
Previously, we had some inconsistencies in how we were handling ASCII
word boundaries. In particular, the translator was accepting a negated
ASCII word boundary even if the caller didn't disable the UTF-8 invariant.
This is wrong, since a negated ASCII word boundary can match between any
two arbitrary bytes. However, fixing this is a breaking change, so for
now we document the bug. We plan to fix it with regex 1.0. See #457.

Additionally, we were incorrectly declaring that an ASCII word boundary
matched invalid UTF-8 via the Hir::is_always_utf8 property. An ASCII word
boundary must always match an ASCII byte on one side, which implies a
valid UTF-8 position.
2018-03-13 21:44:08 -04:00
Andrew Gallant
c7c7a43827 style: reword ast::print docs
Also, small formatting fix and removal of debugging test.
2018-03-13 21:44:08 -04:00
Andrew Gallant
a3c0510711
regex-syntax-0.5.2 2018-03-12 09:49:20 -04:00
Andrew Gallant
102458feff
syntax: fix trailing - bug
This fixes a bug in the parser where a regex like `(?x)[ / - ]` would
fail to parse. In particular, since whitespace insensitive mode is
enabled, this regex should be equivalent to `[/-]`, where the `-` is
treated as a literal `-` instead of a range since it is the last
character in the class. However, the parser did not account for
whitespace insensitive mode, so it didn't see the `-` in `(?x)[ / - ]`
as trailing, and therefore reported an unclosed character class (since
the `]` was treated as part of the range).

We fix that in this commit by accounting for whitespace insensitive
mode, which we do by adding a `peek` method that skips over whitespace.

Fixes #455
2018-03-12 09:27:02 -04:00