13 Commits

Author SHA1 Message Date
Andrew Gallant
27c0d6d944 style: rust updated rustfmt 2020-01-09 14:26:57 -05:00
Andrew Gallant
e98090db75 regex: support perf-literal
This commit enables support for the perf-literal feature. When it's
disabled, no literal optimizations will be performed. Instead, only
the regex engine itself is used.

In practice, it's quite plausible that we don't need to disable *all*
literal optimizations. But that is the simplest path here, and I don't
have the stomach to do anything more with the current code. src/exec.rs
has turned into a giant soup.
2019-09-03 12:35:17 -04:00
Andrew Gallant
0e96af4166
style: start using rustfmt 2019-08-03 14:20:22 -04:00
Andrew Gallant
76343f8cd6 regex: ban (?-u:\B) for Unicode regexes
The issue with the ASCII version of \B is that it can match between code
units of UTF-8, which means it can cause match indices reported to be on
invalid UTF-8 boundaries. Therefore, similar to things like `(?-u:\xFF)`,
we ban negated ASCII word boundaries from Unicode regular expressions.
Normal ASCII word boundaries remain accessible from Unicode regular
expressions.

See #457
2018-05-01 16:48:46 -04:00
Lukas Lueg
264ef3f421 Revert some unwarranted clippy-changes 2017-06-01 19:38:23 +02:00
Lukas Lueg
94f8213def Fix clippy warnings 2017-05-31 22:24:22 +02:00
Andrew Gallant
f094d15678 Update github links. 2016-12-31 16:49:30 -05:00
Andrew Gallant
dd120a963a Require escaping of [, &, - and ~ in classes.
The escaping of &, - and ~ is only required when the characters are
repeated adjacently, which should be quite rare. Escaping of [ is always
required, unless it appear in the second position of a range.

These rules enable us to add character class sets as described in
UTS#18 RL1.3 in a backward compatible way.
2016-12-30 01:06:18 -05:00
Andrew Gallant
1f7f5c9a51 Fix tests. 2016-12-30 01:05:51 -05:00
Andrew Gallant
d44a9f94ab Switch bytes::Regex to using Unicode mode by default. 2016-12-30 01:05:43 -05:00
Scott Steele
b96e5cb899 Verify character class still non-empty after converting to byte class
For `[^\x00-\xff]`, while it is still treated as a full Unicode
character class, it is not empty. For instance `≥` would still be
matched.

However, when `CharClass::to_byte_class` is called on it (as is done
when using `regex::bytes::Regex::new` rather than `regex::Regex::new`),
it _is_ now empty, since it excludes all possible bytes.

This commit adds a test asserting that `regex::bytes::Regex::new`
returns `Err` for this case (in accordance with
https://github.com/rust-lang-nursery/regex/issues/106) and adds an
`is_empty` check to the result of calling `CharClass::to_byte_class`,
which allows the test to pass.
2016-12-07 21:20:08 -05:00
Andrew Gallant
7046d65d3d Fix bug with ^/$ handling in invalid UTF-8.
If a haystack was invalid UTF-8 (which is allowed to by searched using a
`bytes::Regex`), then ^/$ handling was incorrect. Namely, the ^/$
handling assumed that failing to decode a codepoint from the haystack
meant that the position was either at the beginning or end of the string
(which is true if the haystack is guaranteed to be valid UTF-8).
Instead, we should query the position directly instead of relying on the
encoding properties of the haystack.

Fixes #277.
2016-09-04 10:08:32 -04:00
Andrew Gallant
d98ec1b1a5 Add regex matching for &[u8].
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.
2016-03-09 21:23:29 -05:00