1117 Commits

Author SHA1 Message Date
Andrew Gallant
aa170b54db
changelog: 1.5.4 2021-05-06 17:34:04 -04:00
Lucas
11543fd949
pattern: fix compilation
This fixes compilation when the 'pattern' feature
is enabled. This wasn't previously tested since it
is a nightly only feature. But in this commit, we
add it to CI explicitly.

PR #772
2021-05-06 17:32:55 -04:00
Dirk Stolle
238b7759ad
readme: update rustc version in MSRV badge
PR #773
2021-05-05 07:56:28 -04:00
Dirk Stolle
977aabd043
doc: fix some typos
PR #774
2021-05-05 07:56:08 -04:00
Andrew Gallant
26c8d8e461
1.5.3 2021-05-01 20:31:04 -04:00
Andrew Gallant
5f557188e0
deps: bump to regex-syntax 0.6.25 2021-05-01 20:31:02 -04:00
Andrew Gallant
3ea9e3eca7
regex-syntax-0.6.25 2021-05-01 20:30:34 -04:00
Andrew Gallant
908594905a
changelog: 1.5.3 2021-05-01 20:30:27 -04:00
Andrew Gallant
a8554b3cc4 syntax: fix compilation errors with unicode-perl
When only the unicode-perl feature is enabled, regex-syntax would fail
to build. It turns out that 'cargo fix' doesn't actually fix all
imports. It looks like it only fixes things that it can build in the
current configuration.

Fixes #769, Fixes #770
2021-05-01 18:52:18 -04:00
Andrew Gallant
0abcada3a7 ci: test scripts should fail on errors
While these test scripts are running in CI, if any of their commands
fail, they don't actually fail the build.
2021-05-01 18:52:18 -04:00
Andrew Gallant
2393c5555c
1.5.2 2021-05-01 07:44:06 -04:00
Andrew Gallant
eb009655e9
changelog: 1.5.2 2021-05-01 07:44:03 -04:00
Andrew Gallant
036ce80c93 compiler: fix lazy DFA false quits on ASCII text
One of the things the lazy DFA can't handle is Unicode word boundaries,
since it requires multi-byte look-around. However, it turns out that on
pure ASCII text, Unicode word boundaries are equivalent to ASCII word
boundaries. So the DFA has a heuristic: it treats Unicode word
boundaries as ASCII boundaries until it sees a non-ASCII byte. When it
does, it quits, and some other (slower) regex engine needs to take over.

In a bug report against ripgrep[1], it was discovered that the lazy DFA
was quitting and falling back to a slower engine even though the
haystack was pure ASCII.

It turned out that our equivalence byte class optimization was at fault.
Namely, a '{' (which appears very frequently in the input) was being
grouped in with other non-ASCII bytes. So whenever the DFA saw it, it
treated it as a non-ASCII byte and thus stopped.

The fix for this is simple: when we see a Unicode word boundary in the
compiler, we set a boundary on our byte classes such that ASCII bytes
are guaranteed to be in a different class from non-ASCII bytes. And
indeed, this fixes the performance problem reported in [1].

[1] - https://github.com/BurntSushi/ripgrep/issues/1860
2021-05-01 07:42:36 -04:00
Andrew Gallant
374c1680dc
1.5.1 2021-04-30 20:25:22 -04:00
Andrew Gallant
0c6dfbc1d9
impl: fix compilation error when perf-literal is disabled
It's unclear to me why CI did not catch this. CI explicitly tests
building regex without the perf-literal feature enabled.
2021-04-30 20:25:20 -04:00
Andrew Gallant
9f9f693768
1.5.0 2021-04-30 20:11:21 -04:00
Andrew Gallant
b0ff75df4e
impl: remove deprecated use of byte_classes
The auto_configure routine will now never disable it.
2021-04-30 20:10:35 -04:00
Andrew Gallant
f3b8479840
deps: bump regex-syntax minimum version to 0.6.24 2021-04-30 20:09:54 -04:00
Andrew Gallant
00fb09e0b7
regex-syntax-0.6.24 2021-04-30 20:09:30 -04:00
Andrew Gallant
99bd099a20
changelog: 1.5.0 2021-04-30 20:08:51 -04:00
Andrew Gallant
a2a393f1ff fmt: run 'cargo fmt --all'
It looks like 'cargo fix' didn't do this.
2021-04-30 20:02:56 -04:00
Andrew Gallant
832ba73877 msrv: bump to Rust 1.41.1
This was long overdue, and we were motivated by memchr's move to Rust
2018 in https://github.com/BurntSushi/memchr/pull/82.

Rust 1.41.1 was selected because it's the current version of Rust in
Debian Stable. It also feels old enough to assure wide support.
2021-04-30 20:02:56 -04:00
Andrew Gallant
e2860fe037 edition: manual fixups to code
This commit does a number of manual fixups to the code after the
previous two commits were done via 'cargo fix' automatically.

Actually, this contains more 'cargo fix' annotations, since I had
forgotten to add 'edition = "2018"' to all sub-crates.
2021-04-30 20:02:56 -04:00
Andrew Gallant
94ce242913 edition: more 2018 migration (idioms) 2021-04-30 20:02:56 -04:00
Andrew Gallant
cb108b77e7 edition: initial migration to Rust 2018 2021-04-30 20:02:56 -04:00
Andrew Gallant
ccdcf27805 imp: use new memmem impl from memchr crate
This removes the ad hoc FreqyPacked searcher and the implementation of
Boyer-Moore, and replaces it with a new implementation of memmem in the
memchr crate. (Introduced in memchr 2.4.) Since memchr 2.4 also moves to
Rust 2018, we'll do the same in subsequent commits. (Finally.)

The benchmarks look about as expected. Latency on some of the smaller
benchmarks has worsened slightly by a nanosecond or two. The top
throughput speed has also decreased, and some other benchmarks
(especially ones with frequent literal matches) have improved
dramatically.
2021-04-30 20:02:56 -04:00
Andrew Gallant
3db8722d0b
1.4.6 2021-04-22 17:59:28 -04:00
Andrew Gallant
41f14c2d9b fuzz: account for Unicode class size in compiler
This improves the precision of the "expression too big" regex
compilation error. Previously, it was not considering the heap usage
from Unicode character classes.

It's possible this will make some regexes fail to compile that
previously compiled. However, this is a bug fix. If you do wind up
seeing this though, feel free to file an issue, since it would be good
to get an idea of what kinds of regexes no longer compile but did.

This was found by OSS-fuzz:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=33579
2021-04-22 17:59:22 -04:00
Élie ROUDNINSKI
6d95a6f836
impl: shrink size of Inst
By using a boxed slice instead of a vector, we can shrink the size
of the `Inst` structure by 8 bytes going from 40 to 32 bytes on
64-bit platforms.

PR #760
2021-04-14 07:52:15 -04:00
DavidKorczynski
cc0f2c9064
fuzz: update libfuzzer dependency
This is intended to fix an OSS-fuzz build failure detailed here:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=32817

Fixes #757
2021-04-08 10:43:47 -04:00
Andrew Gallant
ff283badce
1.4.5 2021-03-14 14:38:55 -04:00
Andrew Gallant
78c7cefbc9 impl: substantially reduce regex stack size
This commit fixes a fairly large regression in the stack size of a Regex
introduced in regex 1.4.4. When I dropped thread_local and replaced it
with Pool, it turned out that Pool inlined a T into its struct and a
Regex in turn had Pool inlined into itself. It further turns out that
the T=ProgramCache is itself quite large.

We fix this by introducing an indirection in the inner regex type. That
is, we use a Box<Pool> instead of a Pool. This shrinks the size of a
Regex from 856 bytes to 16 bytes.

Interestingly, prior to regex 1.4.4, a Regex was still quite substantial
in size, coming in at around 552 bytes. So it looks like the 1.4.4
release didn't dramatically increase it, but it increased it enough that
folks started experiencing real problems: stack overflows.

Since indirection can lead to worse locality and performance loss, I did
run the benchmark suite. I couldn't see any measurable difference. This
is generally what I would expect. This is an indirection at a fairly
high level. There's lots of other indirection already, and this
indirection isn't accessed in a hot path. (The regex cache itself is of
course used in hot paths, but by the time we get there, we have already
followed this particular pointer.)

We also include a regression test that asserts a Regex (and company) are
16 bytes in size. While this isn't an API guarantee, it at least means
that increasing the size of Regex will be an intentional thing in the
future and not an accidental leakage of implementation details.

Fixes #750, Fixes #751

Ref https://github.com/servo/servo/pull/28269
2021-03-14 14:38:56 -04:00
Andrew Gallant
951b8b93bb
1.4.4 2021-03-11 21:16:13 -05:00
Andrew Gallant
5a3570163b
regex-syntax-0.6.23 2021-03-11 21:15:50 -05:00
Andrew Gallant
967a0905a3
changelog: 1.4.4 2021-03-11 21:15:33 -05:00
Andrew Gallant
e040c1b063 impl: drop thread_local dependency
This commit removes the thread_local dependency (even as an optional
dependency) and replaces it with a more purpose driven memory pool. The
comments in src/pool.rs explain this in more detail, but the short story
is that thread_local seems to be at the root of some memory leaks
happening in certain usage scenarios.

The great thing about thread_local though is how fast it is. Using a
simple Mutex<Vec<T>> is easily at least twice as slow. We work around
that a bit by coding a simplistic fast path for the "owner" of a pool.
This does require one new use of `unsafe`, of which we extensively
document.

This now makes the 'perf-cache' feature a no-op. We of course retain it
for compatibility purposes (and perhaps it will be used again in the
future), but for now, we always use the same pool.

As for benchmarks, it is likely that *some* cases will get a hair
slower. But there shouldn't be any dramatic difference. A careful review
of micro-benchmarks in addition to more holistic (albeit ad hoc)
benchmarks via ripgrep seems to confirm this.

Now that we have more explicit control over the memory pool, we also
clean stuff up with repsect to RefUnwindSafe.

Fixes #362, Fixes #576

Ref https://github.com/BurntSushi/rure-go/issues/3
2021-03-11 21:10:40 -05:00
Andrew Gallant
5107293238 doc: refine use of the word 'unsafe'
This removes extraneous commentary that uses the word 'unsafe'. This
makes it easier to grep for usages of meaningful 'unsafe' in the code.
2021-03-11 21:10:40 -05:00
Andrew Gallant
691ec58171 bench: reduce huge regex a bit
It looks like it blows the default regex size limit at the moment.
2021-03-11 21:10:40 -05:00
Andrew Gallant
f858ff321d deps: update quickcheck and rand
The quickcheck update seems to have sussed out a bug in our DFA logic
regarding the encoding of NFA state IDs. But the bug seems unlikely to
occur in real code, so we massage the test data for now until the lazy
DFA gets moved into regex-automata.
2021-03-11 21:10:40 -05:00
Markus
bf7f8f19c6
doc: use 'text' instead of 'ignore' for regexes
This makes rendering a bit nicer by disabling syntax
highlighting and removing the "untested" warning.

PR #741
2021-01-21 17:50:49 -05:00
Alex Touchet
259863dfb6
doc: use HTTPS in links
PR #726
2021-01-12 07:31:38 -05:00
tom
2bab987149
api: Replacer for more string types
And do the same for the bytes oriented APIs.

This results in some small quality of life improvements
when using the Replacer trait with a string type that
isn't &str.

PR #728
2021-01-12 07:30:51 -05:00
Andrew Gallant
373d5ca4c5
1.4.3 2021-01-08 11:11:18 -05:00
Andrew Gallant
9b8b4074f8
cargo: bump regex-syntax to 0.6.22 2021-01-08 11:11:05 -05:00
Andrew Gallant
d27882cbd8
regex-syntax-0.6.22 2021-01-08 11:10:24 -05:00
Andrew Gallant
c28bf5d4de
changelog: 1.4.3 2021-01-08 11:10:05 -05:00
Ryan Lopopolo
ee94996c5d
api: add missing Debug impls for public types
In general, all public types should have a `Debug` impl.
Some types didn't because it was just never needed, but
it's good form to do it.

PR #735
2020-12-29 17:28:34 -05:00
Ryan Lopopolo
8a81699cfd
api: add missing implementations of core iterator traits
We add these for "good sense" reasons, although it's not
clear how useful or beneficial they are in practice.

PR #734
2020-12-29 13:54:30 -05:00
Andrew Gallant
0b15654ac8
github: replace reference to ripgrep 2020-12-29 13:12:12 -05:00
Andrew Gallant
954e03b478
github: add issue templates
I am getting tired of asking people for reproductions. Hopefully these
issue templates will help with that.
2020-12-29 13:10:41 -05:00