The key insight here is that all we need to do to support leftmost
semantics is to omit ALL failure transitions that appear after a match
state in the trie. (And to omit any entries in the trie that cross a
previously existing match state for leftmost-first semantics, and keep
them for leftmost-longest.)
Previously, I had somehow convinced myself that the subset was more
difficult to identify and required comparing depths. But this is just
not the case. Moreover, once you set the match state to have a failure
transition to the dead state, it automatically propagates to all
subsequent states.
This is such a huge simplification that I combined the 'standard' and
'leftmost' failure transition construction into a single method.
Fixes#92
The unused 'start' field in NonMatch is likely a remnant of some
experiments I was doing to get streaming search working with
leftmost match semantics.
The fact that 'config' is unused in the packed searcher was at
first surprising, but it's only ever used as part of construction.
These options aren't really carrying their weight. In a future release,
aho-corasick will make both options enabled by default all the time with
the impossibility of disabling them. The main reason for this is that
these options require a quadrupling in the amount of code in this crate.
While it's possible to see a performance hit when using byte classes, it
should generally be very small. The improvement, if one exists, just
doesn't see worth it.
Please see https://github.com/BurntSushi/aho-corasick/issues/57 for more
discussion.
This is meant to mirror a similar decision occurring in regex-automata:
https://github.com/BurntSushi/regex-automata/issues/7.
When building the failure transitions, we iterate over the transitions
of each state. When ASCII case insensitivity is enabled, it's possible
for this transition list to contain duplicate states which in turn
results in creating duplicate matches in the NFA graph. It turns out
that this is strictly redundant work, so if we had already see that
state, we can skip it.
Fixes#68
This fixes yet another bug in the prefilter. Sigh. This only occurs when
doing a stream search. The problem is that the stream handling code
assumes that if no match is found at the end of the buffer, then the
current state of the automaton is correctly updated and the buffer can
be rolled.
With most prefilters that look for a candidate *start* of a match, this
is okay. If a prefilter can't find anything, then there's nothing to
start and the current state remains in the starting state.
But if the prefilter looks for a byte that may not be at the start of
the match---like the rare byte prefilter---then we cannot assume that a
match doesn't begin near the end of the buffer searched. And in this
case, the internal implementation of search doesn't correctly hold up
it's contract because the current state won't be updated. That is, there
is an embedded assumption that if a prefilter fails then there is no
match and thus there is no need to update the current state ID. But of
course, this is just not true in a streaming context.
The right way to fix this is unfortunately to rethink how we've
implemented stream searching and make it aware of these kinds of
prefilters. I think, anyway. The other option would be to fix the lower
level search APIs to always make sure the current state ID is correct.
That would fix everything, but that seems tricky and probably requires
some delicate handling.
So for now, we just disable a prefilter entirely if it's a rare byte
prefilter and we're doing a stream search. We could build a backup
prefilter and still use that, but it feels like a gross hack. At least
now, we preserve correctness.
Kudos to @ogoffart who did the initial investigation here and came up
with a regression test, which is included in this commit. Note though,
that some tests do fail when the buffer's size is set to its minimum. So
there was a regression at some point because we aren't getting the best
test coverage. We should just bite the buffer and make the buffer size
configurable as an internal API so that tests can tweak it and provoke
more edge cases.
Fixes#64
It relies on `cfg(doctest)`, which wasn't stabilized until Rust 1.43.
Interestingly, it compiled on Rust 1.28, but didn't compile on, e.g.,
Rust 1.39. This breaks our MSRV policy, so we unfortunately remove the
use of doc_comment for now. It's likely possible to conditionally
enable it, but the extra build script required to do version sniffing to
do it doesn't seem worth it.
The same problem occurred with the regex crate:
d7fbd158f7Fixes#62
This adds a few things to the feature list and updates the section on
prefilters to be in line with the current implementation. (The section
on prefilters had been written before aho-corasick adopted the Teddy
implementation.)
This fixes a bug where the replace_all_with routine wouldn't actually
stop when the closure returned false, even though the documentation
promised it would.
This commit includes test cases in the form of documentation examples.
Closes#59
This fixes another bug in the handling of case insensitivity inside
the rare byte prefilter. In particular, we were not correctly populating
the byte offset table when ASCII case insensitivity was enabled. Instead
of just setting the offsets for bytes we've seen, we also need to set
offsets for the ASCII case insensitive version of each byte we see. We
add that in this commit along with a regression test.
Fixes#55
This fixes a rather nasty bug that occurred when the rare byte prefilter
computed its shift offset incorrectly. In particular, when a rare byte
is found using a prefilter, we shift backwards in the haystack by the
maximum amount possible before confirming whether a match exists or not.
If this shift is not actually the maximum amount possible, then it's
quite possible that we will miss a match. (N.B. The prefilter
infrastructure takes care to avoid accidentally quadratic behavior.)
The specific regression in this case was caused by searching for these
two patterns:
ab/j/
x/
which would erroneously fail to match this haystack
ab/j/
When prefilters are enabled (the default), this particular search would
use the "rare two byte" prefilter. Specifically, it would detect '/' and
'j' as rare bytes, with '1' as the max offset for '/' and '3' as the max
offset for 'j'. The former is clearly incorrect, since '/' occurs at
offset 4 in the first pattern. This was being incorrectly computed
because we weren't actually looking at all possible bytes in all
patterns and recording their offsets. Once we found a rare byte, we
stopped trying to find more occurrences of it.
We fix this byte now recording the maximum offsets of _all_ bytes for
_all_ patterns given. That way, we're guaranteed to have the correct
maximal shift amount for any rare byte found.
Fixes#53