third_party_rust_regex

mirror of https://gitee.com/openharmony/third_party_rust_regex synced 2025-04-13 08:00:56 +00:00

Author	SHA1	Message	Date
Andrew Gallant	cedfc8db51	Re-work case insensitive matching. In commit 56ea4a, char classes were changed so that case folding them stored all possible variants in the class ranges. This makes it possible to drastically simplify the compiler to the point where case folding flags can be completely removed. This has two major implications for performance: 1. Matching engines no longer need to do case folding on the input. 2. Since case folding is now part of the automata, literal prefix optimizations are now automatically applied even to regexes with (?i). This makes several changes in the public API of regex-syntax. Namely, the `casei` flag has been removed from the `CharClass` expression and the corresponding `is_case_insensitive` method has been removed.	2015-07-05 13:13:41 -04:00
Andrew Gallant	56ea4a835c	Fixes #99 . TL;DR - The combination of case folding, character classes and nested negation is darn tricky. The problem presented in #99 was related to how we're storing case folded character classes. Namely, we only store the canonical representation of each character (which means that when we match text, we must apply case folding to the input). But when this representation is negated, information is lost. From #99, consider the negated class with a single range `x`. The class is negated before applying case folding. The negated class includes `X`, so that case folding includes both `X` and `x` even though the regex in #99 is specifically trying to not match either `X` or `x`. The solution is to apply case folding after negation. But given our representation, this doesn't work. Namely, case folding the range `x` yields `x` with a case insensitive flag set. Negating this class ends up matching all characters sans `x`, which means it will match `X`. So I've backtracked the representation to include all case folding variants. This means we can negate case folded classes and get the expected result. e.g., case folding the class `[x]` yields `[xX]`, and negating `[xX]` gives the desired result for the regex in #99.	2015-07-05 11:46:11 -04:00
Andrew Gallant	a66df890f2	Rewrite parser as part of new regex-syntax crate. This commit introduces a new `regex-syntax` crate that provides a regular expression parser and an abstract syntax for regular expressions. As part of this effort, the parser has been rewritten and has grown a substantial number of tests. The `regex` crate itself hasn't changed too much. I opted for the smallest possible delta to get it working with the new regex AST. In most cases, this simplified code because it no longer has to deal with unwieldy flags. (Instead, flag information is baked into the AST.) Here is a list of public facing non-breaking changes: * A new `regex-syntax` crate with a parser, regex AST and lots of tests. This closes #29 and fixes #84. * A new flag, `x`, has been added. This allows one to write regexes with insignificant whitespace and comments. * Repetition operators can now be directly applied to zero-width matches. e.g., `\b+` was previously not allowed but now works. Note that one could always write `(\b)+` previously. This change is mostly about lifting an arbitrary restriction. And a list of breaking changes: * A new `Regex::with_size_limit` constructor function, that allows one to tweak the limit on the size of a compiled regex. This fixes #67. The new method isn't a breaking change, but regexes that exceed the size limit (set to 10MB by default) will no longer compile. To fix, simply call `Regex::with_size_limit` with a bigger limit. * Capture group names cannot start with a number. This is a breaking change because regexes that previously compiled (e.g., `(?P<1a>.)`) will now return an error. This fixes #69. * The `regex::Error` type has been changed to reflect the better error reporting in the `regex-syntax` crate, and a new error for limiting regexes to a certain size. This is a breaking change. Most folks just call `unwrap()` on `Regex::new`, so I expect this to have minimal impact. Closes #29, #67, #69, #79, #84. [breaking-change]	2015-05-25 12:49:58 -04:00
Simon Sapin	a3459ce726	Use Unicode simple case folding for case-insensitivity. Fix #55 .	2015-04-19 18:09:45 +02:00
kwantam	3fc4d2c898	optimize generated tables ; clean up unicode.py There was an easy opportunity to better optimize the tables generated by unicode.py. Not sure why I didn't catch this long ago, but in any case now the tables are substantially smaller and should maybe improve performance slightly. There was also some dead code sitting in unicode.py that I pulled out.	2015-04-16 03:03:55 -04:00
David Corbett	d25c39f865	Fix Noncharacter_Code_Point_table unicode.py now handles whitespace in Unicode data files correctly, so Noncharacter_Code_Point_table now includes U+10FFFE and U+10FFFF.	2015-03-15 21:19:48 -04:00
Alex Crichton	787d81214f	Drop dependence on the unicode crate This commit pulls in the script to generate tables.rs in the main distribution and strips it down to just the bare bones necessary for regexes (which is still quite a lot!). The script was used to generate a `unicode.rs` file which contains all the data needed from the libunicode crate. Eventually we hope to provide libunicode in some form on crates.io or perhaps stabilize it in the distribution itself, but for now it's not so bad to vendor the dependency (which doesn't change much) and it's required to get libregex building on stable Rust.	2015-03-13 18:03:30 -07:00
Andrew Gallant	2d0e77a457	Divorce regex_macros from regex. Fixes #31 and #33. There are a number of related changes in this commit: 1. A script that generates the 'match' tests has been reintroduced. 2. The regex-dna shootout benchmark has been updated. 3. Running `cargo test` on the `regex` crate does not require `regex_macros`. 4. The documentation has been updated to use `Regex::new(...).unwrap()` instead of `regex!`. The emphasis on using `regex!` has been reduced, and a note about its unavailability in Rust 1.0 beta/stable has been added. 5. Updated Travis to test both `regex` and `regex_macros`.	2015-02-28 14:15:36 -05:00

8 Commits