third_party_rust_regex

mirror of https://gitee.com/openharmony/third_party_rust_regex synced 2025-04-11 15:11:28 +00:00

Author	SHA1	Message	Date
Andrew Gallant	ae70b41d4f	security: fix denial-of-service bug in compiler The regex compiler will happily attempt to compile '(?:){294967295}' by compiling the empty sub-expression 294,967,295 times. Empty sub-expressions don't use any memory in the current implementation, so this doesn't trigger the pre-existing machinery for stopping compilation early if the regex object gets too big. The end result is that while compilation will eventually succeed, it takes a very long time to do so. In this commit, we fix this problem by adding a fake amount of memory every time we compile an empty sub-expression. It turns out we were already tracking an additional amount of indirect heap usage via 'extra_inst_bytes' in the compiler, so we just make it look like compiling an empty sub-expression actually adds an additional 'Inst' to the compiled regex object. This has the effect of causing the regex compiler to reject this sort of regex in a reasonable amount of time by default. Many thanks to @VTCAKAVSMoACE for reporting this, providing the valuable test cases and continuing to test this patch as it was developed. Fixes https://github.com/rust-lang/regex/security/advisories/GHSA-m5pq-gvj9-9vr8	2022-03-03 10:05:00 -05:00
Andrew Gallant	a2a393f1ff	fmt: run 'cargo fmt --all' It looks like 'cargo fix' didn't do this.	2021-04-30 20:02:56 -04:00
Andrew Gallant	94ce242913	edition: more 2018 migration (idioms)	2021-04-30 20:02:56 -04:00
Andrew Gallant	78c7cefbc9	impl: substantially reduce regex stack size This commit fixes a fairly large regression in the stack size of a Regex introduced in regex 1.4.4. When I dropped thread_local and replaced it with Pool, it turned out that Pool inlined a T into its struct and a Regex in turn had Pool inlined into itself. It further turns out that the T=ProgramCache is itself quite large. We fix this by introducing an indirection in the inner regex type. That is, we use a Box<Pool> instead of a Pool. This shrinks the size of a Regex from 856 bytes to 16 bytes. Interestingly, prior to regex 1.4.4, a Regex was still quite substantial in size, coming in at around 552 bytes. So it looks like the 1.4.4 release didn't dramatically increase it, but it increased it enough that folks started experiencing real problems: stack overflows. Since indirection can lead to worse locality and performance loss, I did run the benchmark suite. I couldn't see any measurable difference. This is generally what I would expect. This is an indirection at a fairly high level. There's lots of other indirection already, and this indirection isn't accessed in a hot path. (The regex cache itself is of course used in hot paths, but by the time we get there, we have already followed this particular pointer.) We also include a regression test that asserts a Regex (and company) are 16 bytes in size. While this isn't an API guarantee, it at least means that increasing the size of Regex will be an intentional thing in the future and not an accidental leakage of implementation details. Fixes #750, Fixes #751 Ref https://github.com/servo/servo/pull/28269	2021-03-14 14:38:56 -04:00
Andrew Gallant	e040c1b063	impl: drop thread_local dependency This commit removes the thread_local dependency (even as an optional dependency) and replaces it with a more purpose driven memory pool. The comments in src/pool.rs explain this in more detail, but the short story is that thread_local seems to be at the root of some memory leaks happening in certain usage scenarios. The great thing about thread_local though is how fast it is. Using a simple Mutex<Vec<T>> is easily at least twice as slow. We work around that a bit by coding a simplistic fast path for the "owner" of a pool. This does require one new use of `unsafe`, of which we extensively document. This now makes the 'perf-cache' feature a no-op. We of course retain it for compatibility purposes (and perhaps it will be used again in the future), but for now, we always use the same pool. As for benchmarks, it is likely that some cases will get a hair slower. But there shouldn't be any dramatic difference. A careful review of micro-benchmarks in addition to more holistic (albeit ad hoc) benchmarks via ripgrep seems to confirm this. Now that we have more explicit control over the memory pool, we also clean stuff up with repsect to RefUnwindSafe. Fixes #362, Fixes #576 Ref https://github.com/BurntSushi/rure-go/issues/3	2021-03-11 21:10:40 -05:00
Andrew Gallant	9067dfd7e1	fuzz: compile fuzzer in release mode Otherwise, it's pretty easy to generate regexes that are under the size limit but take longer than the 60 second timeout in the fuzzer. This is in response to this bug: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=26324 Actual fuzz test output: https://oss-fuzz.com/testcase-detail/5673225499181056	2020-10-13 19:15:52 -04:00
Andrew Gallant	1e7efa4180	regex: add unicode and perf features This commit sets up the infrastructure for supporting various `unicode` and `perf` features, which permit decreasing binary size, compile times and the size of the dependency tree. Most of the work here is in modifying the regex tests to make them work in concert with the available Unicode features. In cases where Unicode is irrelevant, we just turn it off. In other cases, we require the Unicode features to run the tests. This also introduces a new error in the compiler where by if a Unicode word boundary is used, but the `unicode-perl` feature is disabled, then the regex will fail to compile. (Because the necessary data to match Unicode word boundaries isn't available.)	2019-09-03 12:35:17 -04:00
Andrew Gallant	fc3e6aa19a	license: remove license headers from files The Rust project determined these were unnecessary a while back[1,2,3] and we follow suite. [1] - `0565653eec` [2] - https://github.com/rust-lang/rust/pull/43498 [3] - https://github.com/rust-lang/rust/pull/57108	2019-08-03 14:47:45 -04:00
Andrew Gallant	0e96af4166	style: start using rustfmt	2019-08-03 14:20:22 -04:00
Andrew Gallant	886a7e7185	syntax: move error test to syntax crate The problem with putting it in the regex crate proper is that it requires the regex crate to bump its minimal regex-syntax crate version. While this isn't necessarily an issue, since we can't enable Cargo's minimal version check because of the `rand` dependency, this winds up being a hazard. Plus, having it in the regex crate doesn't buy us too much. It's just as well to have the tests in regex-syntax. Fixes #593	2019-07-06 09:15:11 -04:00
Christian Rondeau	172898a4fd	syntax: better errors missing repetition quantifier This change causes a better error message to surface when a repetition quantifier is used with a missing number. Closes #545	2019-06-11 07:45:27 -04:00
Andrew Gallant	c7065527dc	api: make Regex UnwindSafe again This commit fixes a regression introduced in 1.1.3 where Regex was no longer UnwindSafe. The underlying reason is that the new AhoCorasick type in aho-corasick 0.7 was not UnwindSafe. This has been fixed in aho-corasick 0.7.4, so all we need to do to fix it is to increase the minimum aho-corasick version, which we do here. We also add an oibits test that ensures this particular regression can't happen again. (Along with testing Send and Sync, which surprisingly did not have seem to have tests before this.) Fixes #568	2019-03-31 07:45:17 -04:00
Andrew Gallant	fc06d1a7ea	regex: disable octal syntax by default This commit disables octal syntax by default, which will permit us to produce useful error messages if a user tried to invoke a backreference. This commit adds a new `octal` method to RegexBuilder and RegexSetBuilder which permits callers to re-enable octal syntax. See #457	2018-05-01 16:48:46 -04:00
Andrew Gallant	1f7f5c9a51	Fix tests.	2016-12-30 01:05:51 -05:00
Andrew Gallant	2632c2f846	Rename many of the iterator types. Mostly, this adds an `Iter` suffix to all of the names.	2016-12-30 01:05:50 -05:00
Andrew Gallant	ab72269b23	Add ASCII word boundaries to the lazy DFA. In other words, `\b` in a `bytes::Regex` can now be used in the DFA. This leads to a big performance boost: ``` sherlock::word_ending_n 115,465,261 (5 MB/s) 3,038,621 (195 MB/s) -112,426,640 -97.37% ``` Unfortunately, Unicode word boundaries continue to elude the DFA. This state of affairs is lamentable, but after a lot of thought, I've concluded there are only two ways to speed up Unicode word boundaries: 1. Come up with a hairbrained scheme to add multi-byte look-behind/ahead to the lazy DFA. (The theory says it's possible. Figuring out how to do this without combinatorial state explosion is not within my grasp at the moment.) 2. Build a second lazy DFA with transitions on Unicode codepoints instead of bytes. (The looming inevitability of this makes me queasy for a number of reasons.) To ameliorate this state of affairs, it is now possible to disable Unicode support in `Regex::new` with `(?-u)`. In other words, one can now use an ASCII word boundary with `(?-u:\b)`. Disabling Unicode support does not violate any invariants around UTF-8. In particular, if the regular expression could lead to a match of invalid UTF-8, then the parser will return an error. (This only happens for `Regex::new`. `bytes::Regex::new` still of course allows matching arbitrary bytes.) Finally, a new `PERFORMANCE.md` guide was written.	2016-04-08 22:58:10 -04:00
Andrew Gallant	31a317eadd	Major literal optimization refactoring. The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189	2016-03-27 20:07:46 -04:00
Andrew Gallant	d98ec1b1a5	Add regex matching for &[u8]. This commit enables support for compiling regular expressions that can match on arbitrary byte slices. In particular, we add a new sub-module called `bytes` that duplicates the API of the top-level module, except `&str` for subjects is replaced by `&[u8]`. Additionally, Unicode support in the regular expression is disabled by default but can be selectively re-enabled with the `u` flag. (Unicode support cannot be selectively disabled in the standard top-level API.) Most of the interesting changes occurred in the `regex-syntax` crate, where the AST now explicitly distinguishes between "ASCII compatible" expressions and Unicode aware expressions. This PR makes a few other changes out of convenience: 1. The DFA now knows how to "give up" if it's flushing its cache too often. When the DFA gives up, either backtracking or the NFA algorithm take over, which provides better performance. 2. Benchmarks were added for Oniguruma. 3. The benchmarks in general were overhauled to be defined in one place by using conditional compilation. 4. The tests have been completely reorganized to make it easier to split up the tests depending on which regex engine we're using. For example, we occasionally need to be able to write tests specifically for `regex::Regex` or specifically for `regex::bytes::Regex`. 5. Fixes a bug where NUL bytes weren't represented correctly in the byte class optimization for the DFA. Closes #85.	2016-03-09 21:23:29 -05:00

18 Commits