1181 Commits

Author SHA1 Message Date
Andrew Gallant
258bdf798a
changelog: 1.5.5
This adds the notes after the release, which were overlooked.
2022-03-08 09:46:00 -05:00
Andrew Gallant
d130381b15
1.5.5 2022-03-08 08:58:47 -05:00
Andrew Gallant
ae70b41d4f
security: fix denial-of-service bug in compiler
The regex compiler will happily attempt to compile '(?:){294967295}' by
compiling the empty sub-expression 294,967,295 times. Empty
sub-expressions don't use any memory in the current implementation, so
this doesn't trigger the pre-existing machinery for stopping compilation
early if the regex object gets too big. The end result is that while
compilation will eventually succeed, it takes a very long time to do so.

In this commit, we fix this problem by adding a fake amount of memory
every time we compile an empty sub-expression. It turns out we were
already tracking an additional amount of indirect heap usage via
'extra_inst_bytes' in the compiler, so we just make it look like
compiling an empty sub-expression actually adds an additional 'Inst' to
the compiled regex object.

This has the effect of causing the regex compiler to reject this sort of
regex in a reasonable amount of time by default.

Many thanks to @VTCAKAVSMoACE for reporting this, providing the valuable
test cases and continuing to test this patch as it was developed.

Fixes https://github.com/rust-lang/regex/security/advisories/GHSA-m5pq-gvj9-9vr8
2022-03-03 10:05:00 -05:00
Alex Touchet
b92ffd5471
cargo: use SPDX license format
We were previously using '/' to indicate the dual licensing
scheme, but I guess we're now supposed to use 'OR'.

PR #843
2022-03-03 07:31:45 -05:00
Andrew Gallant
f6e52dafde
syntax: fix 'unused' warnings
It looks like the dead code detector got smarter. We never ended up
using the 'printer' field in these visitors, so just get rid of it.
2022-02-25 12:48:26 -05:00
Catena cyber
5197f21287
fuzz: do not use inherits in Cargo.toml
This fixes the oss-fuzz build.

Specifically, the build log[1] showed this error:

    Step #3 - "compile-libfuzzer-address-x86_64": error: inherits must
    not be specified in root profile dev

So we just remove it and inline the settings.

PR #817

[1] - https://oss-fuzz-build-logs.storage.googleapis.com/log-c9b61873-8950-4a50-a729-820d5617ff7a.txt
2021-11-17 16:49:44 -05:00
Dave Rolsky
3662851482
doc: fix typo
PR #814
2021-11-15 09:52:37 -05:00
Ian Kerins
63ee6699a2
syntax/doc: fix 'their' typo 2021-11-02 18:25:39 -04:00
Alex Touchet
d6bc7a4c3b
readme: remove broken badge
This was missed in bd0a142.

Fixes #797 (again)
2021-07-23 12:49:36 -04:00
Andrew Gallant
bd7466034f
fuzz: try to fix build issue
Ref: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=36474
See: https://oss-fuzz-build-logs.storage.googleapis.com/log-fe51f615-a13f-4685-b8d8-de4583da1ebd.txt
2021-07-23 08:39:44 -04:00
Andrew Gallant
bd0a14231b
readme: fix badges
Fixes #797, Fixes #798
2021-07-23 08:24:45 -04:00
Andrew Gallant
fce37e4932
dfa: remove some redundant branches
I discovered these while reviewing the code to prep for the rewrite
in regex-automata.
2021-06-26 09:16:29 -04:00
Andrew Gallant
6cdb9040f5
fuzz: bump libfuzzer-sys dependency
This is a half-hearted attempt to fix a build failure that I don't
understand in OSS-fuzz:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=34294

cc @DavidKorczynski
2021-05-14 08:31:50 -04:00
Andrew Gallant
f2dc1b788f
1.5.4 2021-05-06 17:34:20 -04:00
Andrew Gallant
aa170b54db
changelog: 1.5.4 2021-05-06 17:34:04 -04:00
Lucas
11543fd949
pattern: fix compilation
This fixes compilation when the 'pattern' feature
is enabled. This wasn't previously tested since it
is a nightly only feature. But in this commit, we
add it to CI explicitly.

PR #772
2021-05-06 17:32:55 -04:00
Dirk Stolle
238b7759ad
readme: update rustc version in MSRV badge
PR #773
2021-05-05 07:56:28 -04:00
Dirk Stolle
977aabd043
doc: fix some typos
PR #774
2021-05-05 07:56:08 -04:00
Andrew Gallant
26c8d8e461
1.5.3 2021-05-01 20:31:04 -04:00
Andrew Gallant
5f557188e0
deps: bump to regex-syntax 0.6.25 2021-05-01 20:31:02 -04:00
Andrew Gallant
3ea9e3eca7
regex-syntax-0.6.25 2021-05-01 20:30:34 -04:00
Andrew Gallant
908594905a
changelog: 1.5.3 2021-05-01 20:30:27 -04:00
Andrew Gallant
a8554b3cc4 syntax: fix compilation errors with unicode-perl
When only the unicode-perl feature is enabled, regex-syntax would fail
to build. It turns out that 'cargo fix' doesn't actually fix all
imports. It looks like it only fixes things that it can build in the
current configuration.

Fixes #769, Fixes #770
2021-05-01 18:52:18 -04:00
Andrew Gallant
0abcada3a7 ci: test scripts should fail on errors
While these test scripts are running in CI, if any of their commands
fail, they don't actually fail the build.
2021-05-01 18:52:18 -04:00
Andrew Gallant
2393c5555c
1.5.2 2021-05-01 07:44:06 -04:00
Andrew Gallant
eb009655e9
changelog: 1.5.2 2021-05-01 07:44:03 -04:00
Andrew Gallant
036ce80c93 compiler: fix lazy DFA false quits on ASCII text
One of the things the lazy DFA can't handle is Unicode word boundaries,
since it requires multi-byte look-around. However, it turns out that on
pure ASCII text, Unicode word boundaries are equivalent to ASCII word
boundaries. So the DFA has a heuristic: it treats Unicode word
boundaries as ASCII boundaries until it sees a non-ASCII byte. When it
does, it quits, and some other (slower) regex engine needs to take over.

In a bug report against ripgrep[1], it was discovered that the lazy DFA
was quitting and falling back to a slower engine even though the
haystack was pure ASCII.

It turned out that our equivalence byte class optimization was at fault.
Namely, a '{' (which appears very frequently in the input) was being
grouped in with other non-ASCII bytes. So whenever the DFA saw it, it
treated it as a non-ASCII byte and thus stopped.

The fix for this is simple: when we see a Unicode word boundary in the
compiler, we set a boundary on our byte classes such that ASCII bytes
are guaranteed to be in a different class from non-ASCII bytes. And
indeed, this fixes the performance problem reported in [1].

[1] - https://github.com/BurntSushi/ripgrep/issues/1860
2021-05-01 07:42:36 -04:00
Andrew Gallant
374c1680dc
1.5.1 2021-04-30 20:25:22 -04:00
Andrew Gallant
0c6dfbc1d9
impl: fix compilation error when perf-literal is disabled
It's unclear to me why CI did not catch this. CI explicitly tests
building regex without the perf-literal feature enabled.
2021-04-30 20:25:20 -04:00
Andrew Gallant
9f9f693768
1.5.0 2021-04-30 20:11:21 -04:00
Andrew Gallant
b0ff75df4e
impl: remove deprecated use of byte_classes
The auto_configure routine will now never disable it.
2021-04-30 20:10:35 -04:00
Andrew Gallant
f3b8479840
deps: bump regex-syntax minimum version to 0.6.24 2021-04-30 20:09:54 -04:00
Andrew Gallant
00fb09e0b7
regex-syntax-0.6.24 2021-04-30 20:09:30 -04:00
Andrew Gallant
99bd099a20
changelog: 1.5.0 2021-04-30 20:08:51 -04:00
Andrew Gallant
a2a393f1ff fmt: run 'cargo fmt --all'
It looks like 'cargo fix' didn't do this.
2021-04-30 20:02:56 -04:00
Andrew Gallant
832ba73877 msrv: bump to Rust 1.41.1
This was long overdue, and we were motivated by memchr's move to Rust
2018 in https://github.com/BurntSushi/memchr/pull/82.

Rust 1.41.1 was selected because it's the current version of Rust in
Debian Stable. It also feels old enough to assure wide support.
2021-04-30 20:02:56 -04:00
Andrew Gallant
e2860fe037 edition: manual fixups to code
This commit does a number of manual fixups to the code after the
previous two commits were done via 'cargo fix' automatically.

Actually, this contains more 'cargo fix' annotations, since I had
forgotten to add 'edition = "2018"' to all sub-crates.
2021-04-30 20:02:56 -04:00
Andrew Gallant
94ce242913 edition: more 2018 migration (idioms) 2021-04-30 20:02:56 -04:00
Andrew Gallant
cb108b77e7 edition: initial migration to Rust 2018 2021-04-30 20:02:56 -04:00
Andrew Gallant
ccdcf27805 imp: use new memmem impl from memchr crate
This removes the ad hoc FreqyPacked searcher and the implementation of
Boyer-Moore, and replaces it with a new implementation of memmem in the
memchr crate. (Introduced in memchr 2.4.) Since memchr 2.4 also moves to
Rust 2018, we'll do the same in subsequent commits. (Finally.)

The benchmarks look about as expected. Latency on some of the smaller
benchmarks has worsened slightly by a nanosecond or two. The top
throughput speed has also decreased, and some other benchmarks
(especially ones with frequent literal matches) have improved
dramatically.
2021-04-30 20:02:56 -04:00
Andrew Gallant
3db8722d0b
1.4.6 2021-04-22 17:59:28 -04:00
Andrew Gallant
41f14c2d9b fuzz: account for Unicode class size in compiler
This improves the precision of the "expression too big" regex
compilation error. Previously, it was not considering the heap usage
from Unicode character classes.

It's possible this will make some regexes fail to compile that
previously compiled. However, this is a bug fix. If you do wind up
seeing this though, feel free to file an issue, since it would be good
to get an idea of what kinds of regexes no longer compile but did.

This was found by OSS-fuzz:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=33579
2021-04-22 17:59:22 -04:00
Élie ROUDNINSKI
6d95a6f836
impl: shrink size of Inst
By using a boxed slice instead of a vector, we can shrink the size
of the `Inst` structure by 8 bytes going from 40 to 32 bytes on
64-bit platforms.

PR #760
2021-04-14 07:52:15 -04:00
DavidKorczynski
cc0f2c9064
fuzz: update libfuzzer dependency
This is intended to fix an OSS-fuzz build failure detailed here:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=32817

Fixes #757
2021-04-08 10:43:47 -04:00
Andrew Gallant
ff283badce
1.4.5 2021-03-14 14:38:55 -04:00
Andrew Gallant
78c7cefbc9 impl: substantially reduce regex stack size
This commit fixes a fairly large regression in the stack size of a Regex
introduced in regex 1.4.4. When I dropped thread_local and replaced it
with Pool, it turned out that Pool inlined a T into its struct and a
Regex in turn had Pool inlined into itself. It further turns out that
the T=ProgramCache is itself quite large.

We fix this by introducing an indirection in the inner regex type. That
is, we use a Box<Pool> instead of a Pool. This shrinks the size of a
Regex from 856 bytes to 16 bytes.

Interestingly, prior to regex 1.4.4, a Regex was still quite substantial
in size, coming in at around 552 bytes. So it looks like the 1.4.4
release didn't dramatically increase it, but it increased it enough that
folks started experiencing real problems: stack overflows.

Since indirection can lead to worse locality and performance loss, I did
run the benchmark suite. I couldn't see any measurable difference. This
is generally what I would expect. This is an indirection at a fairly
high level. There's lots of other indirection already, and this
indirection isn't accessed in a hot path. (The regex cache itself is of
course used in hot paths, but by the time we get there, we have already
followed this particular pointer.)

We also include a regression test that asserts a Regex (and company) are
16 bytes in size. While this isn't an API guarantee, it at least means
that increasing the size of Regex will be an intentional thing in the
future and not an accidental leakage of implementation details.

Fixes #750, Fixes #751

Ref https://github.com/servo/servo/pull/28269
2021-03-14 14:38:56 -04:00
Andrew Gallant
951b8b93bb
1.4.4 2021-03-11 21:16:13 -05:00
Andrew Gallant
5a3570163b
regex-syntax-0.6.23 2021-03-11 21:15:50 -05:00
Andrew Gallant
967a0905a3
changelog: 1.4.4 2021-03-11 21:15:33 -05:00
Andrew Gallant
e040c1b063 impl: drop thread_local dependency
This commit removes the thread_local dependency (even as an optional
dependency) and replaces it with a more purpose driven memory pool. The
comments in src/pool.rs explain this in more detail, but the short story
is that thread_local seems to be at the root of some memory leaks
happening in certain usage scenarios.

The great thing about thread_local though is how fast it is. Using a
simple Mutex<Vec<T>> is easily at least twice as slow. We work around
that a bit by coding a simplistic fast path for the "owner" of a pool.
This does require one new use of `unsafe`, of which we extensively
document.

This now makes the 'perf-cache' feature a no-op. We of course retain it
for compatibility purposes (and perhaps it will be used again in the
future), but for now, we always use the same pool.

As for benchmarks, it is likely that *some* cases will get a hair
slower. But there shouldn't be any dramatic difference. A careful review
of micro-benchmarks in addition to more holistic (albeit ad hoc)
benchmarks via ripgrep seems to confirm this.

Now that we have more explicit control over the memory pool, we also
clean stuff up with repsect to RefUnwindSafe.

Fixes #362, Fixes #576

Ref https://github.com/BurntSushi/rure-go/issues/3
2021-03-11 21:10:40 -05:00