third_party_rust_regex

openharmony/third_party_rust_regex

Fork 0

mirror of https://gitee.com/openharmony/third_party_rust_regex synced 2025-04-07 20:51:33 +00:00

Commit Graph

Author	SHA1	Message	Date
Scott Steele	b96e5cb899	Verify character class still non-empty after converting to byte class For `[^\x00-\xff]`, while it is still treated as a full Unicode character class, it is not empty. For instance `≥` would still be matched. However, when `CharClass::to_byte_class` is called on it (as is done when using `regex::bytes::Regex::new` rather than `regex::Regex::new`), it _is_ now empty, since it excludes all possible bytes. This commit adds a test asserting that `regex::bytes::Regex::new` returns `Err` for this case (in accordance with https://github.com/rust-lang-nursery/regex/issues/106) and adds an `is_empty` check to the result of calling `CharClass::to_byte_class`, which allows the test to pass.	2016-12-07 21:20:08 -05:00
Andrew Gallant	7046d65d3d	Fix bug with ^/$ handling in invalid UTF-8. If a haystack was invalid UTF-8 (which is allowed to by searched using a `bytes::Regex`), then ^/$ handling was incorrect. Namely, the ^/$ handling assumed that failing to decode a codepoint from the haystack meant that the position was either at the beginning or end of the string (which is true if the haystack is guaranteed to be valid UTF-8). Instead, we should query the position directly instead of relying on the encoding properties of the haystack. Fixes #277.	2016-09-04 10:08:32 -04:00
Andrew Gallant	d98ec1b1a5	Add regex matching for &[u8]. This commit enables support for compiling regular expressions that can match on arbitrary byte slices. In particular, we add a new sub-module called `bytes` that duplicates the API of the top-level module, except `&str` for subjects is replaced by `&[u8]`. Additionally, Unicode support in the regular expression is disabled by default but can be selectively re-enabled with the `u` flag. (Unicode support cannot be selectively disabled in the standard top-level API.) Most of the interesting changes occurred in the `regex-syntax` crate, where the AST now explicitly distinguishes between "ASCII compatible" expressions and Unicode aware expressions. This PR makes a few other changes out of convenience: 1. The DFA now knows how to "give up" if it's flushing its cache too often. When the DFA gives up, either backtracking or the NFA algorithm take over, which provides better performance. 2. Benchmarks were added for Oniguruma. 3. The benchmarks in general were overhauled to be defined in one place by using conditional compilation. 4. The tests have been completely reorganized to make it easier to split up the tests depending on which regex engine we're using. For example, we occasionally need to be able to write tests specifically for `regex::Regex` or specifically for `regex::bytes::Regex`. 5. Fixes a bug where NUL bytes weren't represented correctly in the byte class optimization for the DFA. Closes #85.	2016-03-09 21:23:29 -05:00

Author

SHA1

Message

Date

Scott Steele

b96e5cb899

Verify character class still non-empty after converting to byte class

For `[^\x00-\xff]`, while it is still treated as a full Unicode
character class, it is not empty. For instance `≥` would still be
matched.

However, when `CharClass::to_byte_class` is called on it (as is done
when using `regex::bytes::Regex::new` rather than `regex::Regex::new`),
it _is_ now empty, since it excludes all possible bytes.

This commit adds a test asserting that `regex::bytes::Regex::new`
returns `Err` for this case (in accordance with
https://github.com/rust-lang-nursery/regex/issues/106) and adds an
`is_empty` check to the result of calling `CharClass::to_byte_class`,
which allows the test to pass.

2016-12-07 21:20:08 -05:00

Andrew Gallant

7046d65d3d

Fix bug with ^/$ handling in invalid UTF-8.

If a haystack was invalid UTF-8 (which is allowed to by searched using a
`bytes::Regex`), then ^/$ handling was incorrect. Namely, the ^/$
handling assumed that failing to decode a codepoint from the haystack
meant that the position was either at the beginning or end of the string
(which is true if the haystack is guaranteed to be valid UTF-8).
Instead, we should query the position directly instead of relying on the
encoding properties of the haystack.

Fixes #277.

2016-09-04 10:08:32 -04:00

Andrew Gallant

d98ec1b1a5

Add regex matching for &[u8].

This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.

2016-03-09 21:23:29 -05:00

3 Commits