The docs are now updated to work with Unicode 14. (In particular,
emoji-data.txt no longer needs to be downloaded separately.) We also
include a note about adding a new case for "age" in regex-syntax.
This commit refactors the way this library handles Unicode data by
making it completely optional. Several features are introduced which
permit callers to select only the Unicode data they need (up to a point
of granularity).
An important property of these changes is that presence of absence of
crate features will never change the match semantics of a regular
expression. Instead, the presence or absence of a crate feature can only
add or subtract from the set of all possible valid regular expressions.
So for example, if the `unicode-case` feature is disabled, then
attempting to produce `Hir` for the regex `(?i)a` will fail. Instead,
callers must use `(?i-u)a` (or enable the `unicode-case` feature).
This partially addresses #583 since it permits callers to decrease
binary size.
This replaces the previous Python script, which was starting to rot
slightly. In general, I prefer shell scripts for this sort of thing,
even at the cost of some portability across other platforms.
This patch adds some infastructure to scrape crates.io
for regex, then run each of the regex found in this way though
a random testing gauntlet to make sure that all the different
backends behave in the same way. These random tests are
expensive, so we only run them in when the magic
`RUST_REGEX_RANDOM_TEST` environment variable is set.
In debug mode, these tests take quite a while, so we special
case them in CI to run in release mode.
To make this better we should add something which can generate
a matching string from a regex. As is we just focus on
the negative case.
There is one bug that this uncovered that this patch does not
fixed. A minimal version of it is commented out in the
`tests/test_crates_regex.rs` file.
PR #472
This adds `scripts/generate.py`, and uses it to regenerate all tables
with data from Unicode 11.0.0. This also restores the character tests
that were first added in #400, with a new one for 11.
This commit does the mechanical changes necessary to remove the old
regex-syntax crate and replace it with the rewrite. The rewrite now
subsumes the `regex-syntax` crate name, and gets a semver bump to 0.5.0.
The principle change in this commit is a complete rewrite of how
literals are detected from a regular expression. In particular, we now
traverse the abstract syntax to discover literals instead of the
compiled byte code. This permits more tuneable control over which and
how many literals are extracted, and is now exposed in the
`regex-syntax` crate so that others can benefit from it.
Other changes in this commit:
* The Boyer-Moore algorithm was rewritten to use my own concoction based
on frequency analysis. We end up regressing on a couple benchmarks
slightly because of this, but gain in some others and in general should
be faster in a broader number of cases. (Principally because we try to
run `memchr` on the rarest byte in a literal.) This should also greatly
improve handling of non-Western text.
* A "reverse suffix" literal optimization was added. That is, if suffix
literals exist but no prefix literals exist, then we can quickly scan
for suffix matches and then run the DFA in reverse to find matches.
(I'm not aware of any other regex engine that does this.)
* The mutex-based pool has been replaced with a spinlock-based pool
(from the new `mempool` crate). This reduces some amount of constant
overhead and improves several benchmarks that either search short
haystacks or find many matches in long haystacks.
* Search parameters have been refactored.
* RegexSet can now contain 0 or more regular expressions (previously, it
could only contain 2 or more). The InvalidSet error variant is now
deprecated.
* A bug in computing start states was fixed. Namely, the DFA assumed the
start states was always the first instruction, which is trivially
wrong for an expression like `^☃$`. This bug persisted because it
typically occurred when a literal optimization would otherwise run.
* A new CLI tool, regex-debug, has been added as a non-published
sub-crate. The CLI tool can answer various facts about regular
expressions, such as printing its AST, its compiled byte code or its
detected literals.
Closes#96, #188, #189
In commit 56ea4a, char classes were changed so that case folding them
stored all possible variants in the class ranges. This makes it possible
to drastically simplify the compiler to the point where case folding flags
can be completely removed. This has two major implications for
performance:
1. Matching engines no longer need to do case folding on the input.
2. Since case folding is now part of the automata, literal prefix
optimizations are now automatically applied even to regexes with
(?i).
This makes several changes in the public API of regex-syntax. Namely,
the `casei` flag has been removed from the `CharClass` expression and
the corresponding `is_case_insensitive` method has been removed.
TL;DR - The combination of case folding, character classes and nested
negation is darn tricky.
The problem presented in #99 was related to how we're storing case folded
character classes. Namely, we only store the canonical representation
of each character (which means that when we match text, we must apply
case folding to the input). But when this representation is negated,
information is lost.
From #99, consider the negated class with a single range `x`. The class is
negated before applying case folding. The negated class includes `X`,
so that case folding includes both `X` and `x` even though the regex
in #99 is specifically trying to not match either `X` or `x`.
The solution is to apply case folding *after* negation. But given our
representation, this doesn't work. Namely, case folding the range `x`
yields `x` with a case insensitive flag set. Negating this class ends up
matching all characters sans `x`, which means it will match `X`.
So I've backtracked the representation to include *all* case folding
variants. This means we can negate case folded classes and get the
expected result. e.g., case folding the class `[x]` yields `[xX]`, and
negating `[xX]` gives the desired result for the regex in #99.
This commit introduces a new `regex-syntax` crate that provides a
regular expression parser and an abstract syntax for regular
expressions. As part of this effort, the parser has been rewritten and
has grown a substantial number of tests.
The `regex` crate itself hasn't changed too much. I opted for the
smallest possible delta to get it working with the new regex AST.
In most cases, this simplified code because it no longer has to deal
with unwieldy flags. (Instead, flag information is baked into the AST.)
Here is a list of public facing non-breaking changes:
* A new `regex-syntax` crate with a parser, regex AST and lots of tests.
This closes#29 and fixes#84.
* A new flag, `x`, has been added. This allows one to write regexes with
insignificant whitespace and comments.
* Repetition operators can now be directly applied to zero-width
matches. e.g., `\b+` was previously not allowed but now works.
Note that one could always write `(\b)+` previously. This change
is mostly about lifting an arbitrary restriction.
And a list of breaking changes:
* A new `Regex::with_size_limit` constructor function, that allows one
to tweak the limit on the size of a compiled regex. This fixes#67.
The new method isn't a breaking change, but regexes that exceed the
size limit (set to 10MB by default) will no longer compile. To fix,
simply call `Regex::with_size_limit` with a bigger limit.
* Capture group names cannot start with a number. This is a breaking
change because regexes that previously compiled (e.g., `(?P<1a>.)`)
will now return an error. This fixes#69.
* The `regex::Error` type has been changed to reflect the better error
reporting in the `regex-syntax` crate, and a new error for limiting
regexes to a certain size. This is a breaking change. Most folks just
call `unwrap()` on `Regex::new`, so I expect this to have minimal
impact.
Closes#29, #67, #69, #79, #84.
[breaking-change]
There was an easy opportunity to better optimize the tables generated
by unicode.py. Not sure why I didn't catch this long ago, but in any
case now the tables are substantially smaller and should maybe improve
performance slightly.
There was also some dead code sitting in unicode.py that I pulled out.
This commit pulls in the script to generate tables.rs in the main distribution
and strips it down to just the bare bones necessary for regexes (which is still
quite a lot!). The script was used to generate a `unicode.rs` file which
contains all the data needed from the libunicode crate.
Eventually we hope to provide libunicode in some form on crates.io or perhaps
stabilize it in the distribution itself, but for now it's not so bad to vendor
the dependency (which doesn't change much) and it's required to get libregex
building on stable Rust.
Fixes#31 and #33.
There are a number of related changes in this commit:
1. A script that generates the 'match' tests has been reintroduced.
2. The regex-dna shootout benchmark has been updated.
3. Running `cargo test` on the `regex` crate does not require
`regex_macros`.
4. The documentation has been updated to use `Regex::new(...).unwrap()`
instead of `regex!`. The emphasis on using `regex!` has been reduced,
and a note about its unavailability in Rust 1.0 beta/stable has been
added.
5. Updated Travis to test both `regex` and `regex_macros`.