PR #86
24 KiB
This document describes the internal design of this crate, which is an object lesson in what happens when you take a fairly simple old algorithm like Aho-Corasick and make it fast and production ready.
The target audience of this document is Rust programmers that have some familiarity with string searching, however, one does not need to know the Aho-Corasick algorithm in order to read this (it is explained below). One should, however, know what a trie is. (If you don't, go read its Wikipedia article.)
The center-piece of this crate is an implementation of Aho-Corasick. On its own, Aho-Corasick isn't that complicated. The complex pieces come from the different variants of Aho-Corasick implemented in this crate. Specifically, they are:
- Aho-Corasick as an NFA, using dense transitions near the root with sparse transitions elsewhere.
- Aho-Corasick as a DFA. (An NFA is slower to search, but cheaper to construct
and uses less memory.)
- A DFA with pre-multiplied state identifiers. This saves a multiplication instruction in the core search loop.
- A DFA with equivalence classes of bytes as the alphabet, instead of the traditional 256-byte alphabet. This shrinks the size of the DFA in memory, but adds an extra lookup in the core search loop to map the input byte to an equivalent class.
- The option to choose how state identifiers are represented, via one of u8, u16, u32, u64 or usize. This permits creating compact automatons when matching a small number of patterns.
- Supporting "standard" match semantics, along with its overlapping variant, in addition to leftmost-first and leftmost-longest semantics. The "standard" semantics are typically what you see in a textbook description of Aho-Corasick. However, Aho-Corasick is also useful as an optimization in regex engines, which often use leftmost-first or leftmost-longest semantics. Thus, it is useful to implement those semantics here. The "standard" and "leftmost" search algorithms are subtly different, and also require slightly different construction algorithms.
- Support for ASCII case insensitive matching.
- Support for accelerating searches when the patterns all start with a small
number of fixed bytes. Or alternatively, when the patterns all contain a
small number of rare bytes. (Searching for these bytes uses SIMD vectorized
code courtesy of
memchr
.) - Transparent support for alternative SIMD vectorized search routines for smaller number of literals, such as the Teddy algorithm. We called these "packed" search routines because they use SIMD. They can often be an order of magnitude faster than just Aho-Corasick, but don't scale as well.
- Support for searching streams. This can reuse most of the underlying code, but does require careful buffering support.
- Support for anchored searches, which permit efficient
is_prefix
checks for a large number of patterns.
When you combine all of this together along with trying to make everything as
fast as possible, what you end up with is enitrely too much code with too much
unsafe
. Alas, I was not smart enough to figure out how to reduce it. Instead,
we will explain it.
Basics
The fundamental problem this crate is trying to solve is to determine the occurrences of possibly many patterns in a haystack. The naive way to solve this is to look for a match for each pattern at each position in the haystack:
for i in 0..haystack.len():
for p in patterns.iter():
if haystack[i..].starts_with(p.bytes()):
return Match(p.id(), i, i + p.bytes().len())
Those four lines are effectively all this crate does. The problem with those four lines is that they are very slow, especially when you're searching for a large number of patterns.
While there are many different algorithms available to solve this, a popular one is Aho-Corasick. It's a common solution because it's not too hard to implement, scales quite well even when searching for thousands of patterns and is generally pretty fast. Aho-Corasick does well here because, regardless of the number of patterns you're searching for, it always visits each byte in the haystack exactly once. This means, generally speaking, adding more patterns to an Aho-Corasick automaton does not make it slower. (Strictly speaking, however, this is not true, since a larger automaton will make less effective use of the CPU's cache.)
Aho-Corasick can be succinctly described as a trie with state transitions between some of the nodes that efficiently instruct the search algorithm to try matching alternative keys in the automaton. The trick is that these state transitions are arranged such that each byte of input needs to be inspected only once. These state transitions are typically called "failure transitions," because they instruct the searcher (the thing traversing the automaton while reading from the haystack) what to do when a byte in the haystack does not correspond to a valid transition in the current state of the trie.
More formally, a failure transition points to a state in the automaton that may lead to a match whose prefix is a proper suffix of the path traversed through the trie so far. (If no such proper suffix exists, then the failure transition points back to the start state of the trie, effectively restarting the search.) This is perhaps simpler to explain pictorally. For example, let's say we built an Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. The trie looks like this:
a - S1 - b - S2 - c - S3 - d - S4*
/
S0 - c - S5 - e - S6 - f - S7*
where states marked with a *
are match states (meaning, the search algorithm
should stop and report a match to the caller).
So given this trie, it should be somewhat straight-forward to see how it can
be used to determine whether any particular haystack starts with either
abcd
or cef
. It's easy to express this in code:
fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool {
let mut state_id = trie.start();
// If the empty pattern is in trie, then state_id is a match state.
if trie.is_match(state_id) {
return true;
}
for (i, &b) in haystack.iter().enumerate() {
state_id = match trie.next_state(state_id, b) {
Some(id) => id,
// If there was no transition for this state and byte, then we know
// the haystack does not start with one of the patterns in our trie.
None => return false,
};
if trie.is_match(state_id) {
return true;
}
}
false
}
And that's pretty much it. All we do is move through the trie starting with the bytes at the beginning of the haystack. If we find ourselves in a position where we can't move, or if we've looked through the entire haystack without seeing a match state, then we know the haystack does not start with any of the patterns in the trie.
The meat of the Aho-Corasick algorithm is in how we add failure transitions to our trie to keep searching efficient. Specifically, it permits us to not only check whether a haystack starts with any one of a number of patterns, but rather, whether the haystack contains any of a number of patterns anywhere in the haystack.
As mentioned before, failure transitions connect a proper suffix of the path
traversed through the trie before, with a path that leads to a match that has a
prefix corresponding to that proper suffix. So in our case, for patterns abcd
and cef
, with a haystack abcef
, we want to transition to state S5
(from
the diagram above) from S3
upon seeing that the byte following c
is not
d
. Namely, the proper suffix in this example is c
, which is a prefix of
cef
. So the modified diagram looks like this:
a - S1 - b - S2 - c - S3 - d - S4*
/ /
/ ----------------
/ /
S0 - c - S5 - e - S6 - f - S7*
One thing that isn't shown in this diagram is that all states have a failure
transition, but only S3
has a non-trivial failure transition. That is, all
other states have a failure transition back to the start state. So if our
haystack was abzabcd
, then the searcher would transition back to S0
after
seeing z
, which effectively restarts the search. (Because there is no pattern
in our trie that has a prefix of bz
or z
.)
The code for traversing this automaton or finite state machine (it is no
longer just a trie) is not that much different from the has_prefix
code
above:
fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool {
let mut state_id = fsm.start();
// If the empty pattern is in fsm, then state_id is a match state.
if fsm.is_match(state_id) {
return true;
}
for (i, &b) in haystack.iter().enumerate() {
// While the diagram above doesn't show this, we may wind up needing
// to follow multiple failure transitions before we land on a state
// in which we can advance. Therefore, when searching for the next
// state, we need to loop until we don't see a failure transition.
//
// This loop terminates because the start state has no empty
// transitions. Every transition from the start state either points to
// another state, or loops back to the start state.
loop {
match fsm.next_state(state_id, b) {
Some(id) => {
state_id = id;
break;
}
// Unlike our code above, if there was no transition for this
// state, then we don't quit. Instead, we look for this state's
// failure transition and follow that instead.
None => {
state_id = fsm.next_fail_state(state_id);
}
};
}
if fsm.is_match(state_id) {
return true;
}
}
false
}
Other than the complication around traversing failure transitions, this code is still roughly "traverse the automaton with bytes from the haystack, and quit when a match is seen."
And that concludes our section on the basics. While we didn't go deep into
how the automaton is built (see src/nfa.rs
, which has detailed comments about
that), the basic structure of Aho-Corasick should be reasonably clear.
NFAs and DFAs
There are generally two types of finite automata: non-deterministic finite automata (NFA) and deterministic finite automata (DFA). The difference between them is, principally, that an NFA can be in multiple states at once. This is typically accomplished by things called epsilon transitions, where one could move to a new state without consuming any bytes from the input. (The other mechanism by which NFAs can be in more than one state is where the same byte in a particular state transitions to multiple distinct states.) In contrast, a DFA can only ever be in one state at a time. A DFA has no epsilon transitions, and for any given state, a byte transitions to at most one other state.
By this formulation, the Aho-Corasick automaton described in the previous
section is an NFA. This is because failure transitions are, effectively,
epsilon transitions. That is, whenever the automaton is in state S
, it is
actually in the set of states that are reachable by recursively following
failure transitions from S
. (This means that, for example, the start state
is always active since the start state is reachable via failure transitions
from any state in the automaton.)
NFAs have a lot of nice properties. They tend to be easier to construct, and also tend to use less memory. However, their primary downside is that they are typically slower to execute. For example, the code above showing how to search with an Aho-Corasick automaton needs to potentially iterate through many failure transitions for every byte of input. While this is a fairly small amount of overhead, this can add up, especially if the automaton has a lot of overlapping patterns with a lot of failure transitions.
A DFA's search code, by contrast, looks like this:
fn contains(dfa: &DFA, haystack: &[u8]) -> bool {
let mut state_id = dfa.start();
// If the empty pattern is in dfa, then state_id is a match state.
if dfa.is_match(state_id) {
return true;
}
for (i, &b) in haystack.iter().enumerate() {
// An Aho-Corasick DFA *never* has a missing state that requires
// failure transitions to be followed. One byte of input advances the
// automaton by one state. Always.
state_id = dfa.next_state(state_id, b);
if dfa.is_match(state_id) {
return true;
}
}
false
}
The search logic here is much simpler than for the NFA, and this tends to translate into significant performance benefits as well, since there's a lot less work being done for each byte in the haystack. How is this accomplished? It's done by pre-following all failure transitions for all states for all bytes in the alphabet, and then building a single state transition table. Building this DFA can be much more costly than building the NFA, and use much more memory, but the better performance can be worth it.
Users of this crate can actually choose between using an NFA or a DFA. By
default, an NFA is used, because it typically strikes the best balance between
space usage and search performance. But the DFA option is available for cases
where a little extra memory and upfront time building the automaton is okay.
For example, the AhoCorasick::auto_configure
and
AhoCorasickBuilder::auto_configure
methods will enable the DFA setting if
there are a small number of patterns.
More DFA tricks
As described in the previous section, one of the downsides of using a DFA is that it uses more memory and can take longer to build. One small way of mitigating these concerns is to map the alphabet used by the automaton into a smaller space. Typically, the alphabet of a DFA has 256 elements in it: one element for each possible value that fits into a byte. However, in many cases, one does not need the full alphabet. For example, if all patterns in an Aho-Corasick automaton are ASCII letters, then this only uses up 52 distinct bytes. As far as the automaton is concerned, the rest of the 204 bytes are indistinguishable from one another: they will never disrciminate between a match or a non-match. Therefore, in cases like that, the alphabet can be shrunk to just 53 elements. One for each ASCII letter, and then another to serve as a placeholder for every other unused byte.
In practice, this library doesn't quite compute the optimal set of equivalence classes, but it's close enough in most cases. The key idea is that this then allows the transition table for the DFA to be potentially much smaller. The downside of doing this, however, is that since the transition table is defined in terms of this smaller alphabet space, every byte in the haystack must be re-mapped to this smaller space. This requires an additional 256-byte table. In practice, this can lead to a small search time hit, but it can be difficult to measure. Moreover, it can sometimes lead to faster search times for bigger automata, since it could be difference between more parts of the automaton staying in the CPU cache or not.
One other trick for DFAs employed by this crate is the notion of premultiplying state identifiers. Specifically, the normal way to compute the next transition in a DFA is via the following (assuming that the transition table is laid out sequentially in memory, in row-major order, where the rows are states):
next_state_id = dfa.transitions[current_state_id * 256 + current_byte]
However, since the value 256
is a fixed constant, we can actually premultiply
the state identifiers in the table when we build the table initially. Then, the
next transition computation simply becomes:
next_state_id = dfa.transitions[current_state_id + current_byte]
This doesn't seem like much, but when this is being executed for every byte of input that you're searching, saving that extra multiplication instruction can add up.
The same optimization works even when equivalence classes are enabled, as described above. The only difference is that the premultiplication is by the total number of equivalence classes instead of 256.
There isn't much downside to premultiplying state identifiers, other than the
fact that you may need to choose a bigger integer representation than you would
otherwise. For example, if you don't premultiply state identifiers, then an
automaton that uses u8
as a state identifier can hold up to 256 states.
However, if they are premultiplied, then it can only hold up to
floor(256 / len(alphabet))
states. Thus premultiplication impacts how compact
your DFA can be. In practice, it's pretty rare to use u8
as a state
identifier, so premultiplication is usually a good thing to do.
Both equivalence classes and premultiplication are tuneable parameters via the
AhoCorasickBuilder
type, and both are enabled by default.
Match semantics
One of the more interesting things about this implementation of Aho-Corasick that (as far as this author knows) separates it from other implementations, is that it natively supports leftmost-first and leftmost-longest match semantics. Briefly, match semantics refer to the decision procedure by which searching will disambiguate matches when there are multiple to choose from:
- standard match semantics emits matches as soon as they are detected by the automaton. This is typically equivalent to the textbook non-overlapping formulation of Aho-Corasick.
- leftmost-first match semantics means that 1) the next match is the match starting at the leftmost position and 2) among multiple matches starting at the same leftmost position, the match corresponding to the pattern provided first by the caller is reported.
- leftmost-longest is like leftmost-first, except when there are multiple matches starting at the same leftmost position, the pattern corresponding to the longest match is returned.
(The crate API documentation discusses these differences, with examples, in
more depth on the MatchKind
type.)
The reason why supporting these match semantics is important is because it gives the user more control over the match procedure. For example, leftmost-first permits users to implement match priority by simply putting the higher priority patterns first. Leftmost-longest, on the other hand, permits finding the longest possible match, which might be useful when trying to find words matching a dictionary. Additionally, regex engines often want to use Aho-Corasick as an optimization when searching for an alternation of literals. In order to preserve correct match semantics, regex engines typically can't use the standard textbook definition directly, since regex engines will implement either leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics.
Supporting leftmost semantics requires a couple key changes:
- Constructing the Aho-Corasick automaton changes a bit in both how the trie is
constructed and how failure transitions are found. Namely, only a subset of
the failure transitions are added. Specifically, only the failure transitions
that either do not occur after a match or do occur after a match but preserve
that match are kept. (More details on this can be found in
src/nfa.rs
.) - The search algorithm changes slightly. Since we are looking for the leftmost match, we cannot quit as soon as a match is detected. Instead, after a match is detected, we must keep searching until either the end of the input or until a dead state is seen. (Dead states are not used for standard match semantics. Dead states mean that searching should stop after a match has been found.)
Other implementations of Aho-Corasick do support leftmost match semantics, but they do it with more overhead at search time, or even worse, with a queue of matches and sophisticated hijinks to disambiguate the matches. While our construction algorithm becomes a bit more complicated, the correct match semantics fall out from the structure of the automaton itself.
Overlapping matches
One of the nice properties of an Aho-Corasick automaton is that it can report all possible matches, even when they overlap with one another. In this mode, the match semantics don't matter, since all possible matches are reported. Overlapping searches work just like regular searches, except the state identifier at which the previous search left off is carried over to the next search, so that it can pick up where it left off. If there are additional matches at that state, then they are reported before resuming the search.
Enabling leftmost-first or leftmost-longest match semantics causes the automaton to use a subset of all failure transitions, which means that overlapping searches cannot be used. Therefore, if leftmost match semantics are used, attempting to do an overlapping search will panic. Thus, to get overlapping searches, the caller must use the default standard match semantics. This behavior was chosen because there are only two alternatives, which were deemed worse:
- Compile two automatons internally, one for standard semantics and one for the semantics requested by the caller (if not standard).
- Create a new type, distinct from the
AhoCorasick
type, which has different capabilities based on the configuration options.
The first is untenable because of the amount of memory used by the automaton.
The second increases the complexity of the API too much by adding too many
types that do similar things. It is conceptually much simpler to keep all
searching isolated to a single type. Callers may query whether the automaton
supports overlapping searches via the AhoCorasick::supports_overlapping
method.
Stream searching
Since Aho-Corasick is an automaton, it is possible to do partial searches on partial parts of the haystack, and then resume that search on subsequent pieces of the haystack. This is useful when the haystack you're trying to search is not stored contiguously in memory, or if one does not want to read the entire haystack into memory at once.
Currently, only standard semantics are supported for stream searching. This is some of the more complicated code in this crate, and is something I would very much like to improve. In particular, it currently has the restriction that it must buffer at least enough of the haystack in memory in order to fit the longest possible match. The difficulty in getting stream searching right is that the implementation choices (such as the buffer size) often impact what the API looks like and what it's allowed to do.
Prefilters
In some cases, Aho-Corasick is not the fastest way to find matches containing multiple patterns. Sometimes, the search can be accelerated using highly optimized SIMD routines. For example, consider searching the following patterns:
Sherlock
Moriarty
Watson
It is plausible that it would be much faster to quickly look for occurrences of
the leading bytes, S
, M
or W
, before trying to start searching via the
automaton. Indeed, this is exactly what this crate will do.
When there are more than three distinct starting bytes, then this crate will look for three distinct bytes occurring at any position in the patterns, while preferring bytes that are heuristically determined to be rare over others. For example:
Abuzz
Sanchez
Vasquez
Topaz
Waltz
Here, we have more than 3 distinct starting bytes, but all of the patterns
contain z
, which is typically a rare byte. In this case, the prefilter will
scan for z
, back up a bit, and then execute the Aho-Corasick automaton.
If all of that fails, then a packed multiple substring algorithm will be
attempted. Currently, the only algorithm available for this is Teddy, but more
may be added in the future. Teddy is unlike the above prefilters in that it
confirms its own matches, so when Teddy is active, it might not be necessary
for Aho-Corasick to run at all. (See Automaton::leftmost_find_at_no_state_imp
in src/automaton.rs
.) However, the current Teddy implementation only works
in x86_64
and when SSSE3 or AVX2 are available, and moreover, only works
well when there are a small number of patterns (say, less than 100). Teddy
also requires the haystack to be of a certain length (more than 16-34 bytes).
When the haystack is shorter than that, Rabin-Karp is used instead. (See
src/packed/rabinkarp.rs
.)
There is a more thorough description of Teddy at
src/packed/teddy/README.md
.