The regex compiler will happily attempt to compile '(?:){294967295}' by
compiling the empty sub-expression 294,967,295 times. Empty
sub-expressions don't use any memory in the current implementation, so
this doesn't trigger the pre-existing machinery for stopping compilation
early if the regex object gets too big. The end result is that while
compilation will eventually succeed, it takes a very long time to do so.
In this commit, we fix this problem by adding a fake amount of memory
every time we compile an empty sub-expression. It turns out we were
already tracking an additional amount of indirect heap usage via
'extra_inst_bytes' in the compiler, so we just make it look like
compiling an empty sub-expression actually adds an additional 'Inst' to
the compiled regex object.
This has the effect of causing the regex compiler to reject this sort of
regex in a reasonable amount of time by default.
Many thanks to @VTCAKAVSMoACE for reporting this, providing the valuable
test cases and continuing to test this patch as it was developed.
Fixes https://github.com/rust-lang/regex/security/advisories/GHSA-m5pq-gvj9-9vr8
This fixes compilation when the 'pattern' feature
is enabled. This wasn't previously tested since it
is a nightly only feature. But in this commit, we
add it to CI explicitly.
PR #772
When only the unicode-perl feature is enabled, regex-syntax would fail
to build. It turns out that 'cargo fix' doesn't actually fix all
imports. It looks like it only fixes things that it can build in the
current configuration.
Fixes#769, Fixes#770
One of the things the lazy DFA can't handle is Unicode word boundaries,
since it requires multi-byte look-around. However, it turns out that on
pure ASCII text, Unicode word boundaries are equivalent to ASCII word
boundaries. So the DFA has a heuristic: it treats Unicode word
boundaries as ASCII boundaries until it sees a non-ASCII byte. When it
does, it quits, and some other (slower) regex engine needs to take over.
In a bug report against ripgrep[1], it was discovered that the lazy DFA
was quitting and falling back to a slower engine even though the
haystack was pure ASCII.
It turned out that our equivalence byte class optimization was at fault.
Namely, a '{' (which appears very frequently in the input) was being
grouped in with other non-ASCII bytes. So whenever the DFA saw it, it
treated it as a non-ASCII byte and thus stopped.
The fix for this is simple: when we see a Unicode word boundary in the
compiler, we set a boundary on our byte classes such that ASCII bytes
are guaranteed to be in a different class from non-ASCII bytes. And
indeed, this fixes the performance problem reported in [1].
[1] - https://github.com/BurntSushi/ripgrep/issues/1860
This was long overdue, and we were motivated by memchr's move to Rust
2018 in https://github.com/BurntSushi/memchr/pull/82.
Rust 1.41.1 was selected because it's the current version of Rust in
Debian Stable. It also feels old enough to assure wide support.
This commit does a number of manual fixups to the code after the
previous two commits were done via 'cargo fix' automatically.
Actually, this contains more 'cargo fix' annotations, since I had
forgotten to add 'edition = "2018"' to all sub-crates.
This removes the ad hoc FreqyPacked searcher and the implementation of
Boyer-Moore, and replaces it with a new implementation of memmem in the
memchr crate. (Introduced in memchr 2.4.) Since memchr 2.4 also moves to
Rust 2018, we'll do the same in subsequent commits. (Finally.)
The benchmarks look about as expected. Latency on some of the smaller
benchmarks has worsened slightly by a nanosecond or two. The top
throughput speed has also decreased, and some other benchmarks
(especially ones with frequent literal matches) have improved
dramatically.
This improves the precision of the "expression too big" regex
compilation error. Previously, it was not considering the heap usage
from Unicode character classes.
It's possible this will make some regexes fail to compile that
previously compiled. However, this is a bug fix. If you do wind up
seeing this though, feel free to file an issue, since it would be good
to get an idea of what kinds of regexes no longer compile but did.
This was found by OSS-fuzz:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=33579
By using a boxed slice instead of a vector, we can shrink the size
of the `Inst` structure by 8 bytes going from 40 to 32 bytes on
64-bit platforms.
PR #760
This commit fixes a fairly large regression in the stack size of a Regex
introduced in regex 1.4.4. When I dropped thread_local and replaced it
with Pool, it turned out that Pool inlined a T into its struct and a
Regex in turn had Pool inlined into itself. It further turns out that
the T=ProgramCache is itself quite large.
We fix this by introducing an indirection in the inner regex type. That
is, we use a Box<Pool> instead of a Pool. This shrinks the size of a
Regex from 856 bytes to 16 bytes.
Interestingly, prior to regex 1.4.4, a Regex was still quite substantial
in size, coming in at around 552 bytes. So it looks like the 1.4.4
release didn't dramatically increase it, but it increased it enough that
folks started experiencing real problems: stack overflows.
Since indirection can lead to worse locality and performance loss, I did
run the benchmark suite. I couldn't see any measurable difference. This
is generally what I would expect. This is an indirection at a fairly
high level. There's lots of other indirection already, and this
indirection isn't accessed in a hot path. (The regex cache itself is of
course used in hot paths, but by the time we get there, we have already
followed this particular pointer.)
We also include a regression test that asserts a Regex (and company) are
16 bytes in size. While this isn't an API guarantee, it at least means
that increasing the size of Regex will be an intentional thing in the
future and not an accidental leakage of implementation details.
Fixes#750, Fixes#751
Ref https://github.com/servo/servo/pull/28269
This commit removes the thread_local dependency (even as an optional
dependency) and replaces it with a more purpose driven memory pool. The
comments in src/pool.rs explain this in more detail, but the short story
is that thread_local seems to be at the root of some memory leaks
happening in certain usage scenarios.
The great thing about thread_local though is how fast it is. Using a
simple Mutex<Vec<T>> is easily at least twice as slow. We work around
that a bit by coding a simplistic fast path for the "owner" of a pool.
This does require one new use of `unsafe`, of which we extensively
document.
This now makes the 'perf-cache' feature a no-op. We of course retain it
for compatibility purposes (and perhaps it will be used again in the
future), but for now, we always use the same pool.
As for benchmarks, it is likely that *some* cases will get a hair
slower. But there shouldn't be any dramatic difference. A careful review
of micro-benchmarks in addition to more holistic (albeit ad hoc)
benchmarks via ripgrep seems to confirm this.
Now that we have more explicit control over the memory pool, we also
clean stuff up with repsect to RefUnwindSafe.
Fixes#362, Fixes#576
Ref https://github.com/BurntSushi/rure-go/issues/3