third_party_rust_aho-corasick/DESIGN.md
2021-07-03 07:52:44 -04:00

24 KiB

This document describes the internal design of this crate, which is an object lesson in what happens when you take a fairly simple old algorithm like Aho-Corasick and make it fast and production ready.

The target audience of this document is Rust programmers that have some familiarity with string searching, however, one does not need to know the Aho-Corasick algorithm in order to read this (it is explained below). One should, however, know what a trie is. (If you don't, go read its Wikipedia article.)

The center-piece of this crate is an implementation of Aho-Corasick. On its own, Aho-Corasick isn't that complicated. The complex pieces come from the different variants of Aho-Corasick implemented in this crate. Specifically, they are:

  • Aho-Corasick as an NFA, using dense transitions near the root with sparse transitions elsewhere.
  • Aho-Corasick as a DFA. (An NFA is slower to search, but cheaper to construct and uses less memory.)
    • A DFA with pre-multiplied state identifiers. This saves a multiplication instruction in the core search loop.
    • A DFA with equivalence classes of bytes as the alphabet, instead of the traditional 256-byte alphabet. This shrinks the size of the DFA in memory, but adds an extra lookup in the core search loop to map the input byte to an equivalent class.
  • The option to choose how state identifiers are represented, via one of u8, u16, u32, u64 or usize. This permits creating compact automatons when matching a small number of patterns.
  • Supporting "standard" match semantics, along with its overlapping variant, in addition to leftmost-first and leftmost-longest semantics. The "standard" semantics are typically what you see in a textbook description of Aho-Corasick. However, Aho-Corasick is also useful as an optimization in regex engines, which often use leftmost-first or leftmost-longest semantics. Thus, it is useful to implement those semantics here. The "standard" and "leftmost" search algorithms are subtly different, and also require slightly different construction algorithms.
  • Support for ASCII case insensitive matching.
  • Support for accelerating searches when the patterns all start with a small number of fixed bytes. Or alternatively, when the patterns all contain a small number of rare bytes. (Searching for these bytes uses SIMD vectorized code courtesy of memchr.)
  • Transparent support for alternative SIMD vectorized search routines for smaller number of literals, such as the Teddy algorithm. We called these "packed" search routines because they use SIMD. They can often be an order of magnitude faster than just Aho-Corasick, but don't scale as well.
  • Support for searching streams. This can reuse most of the underlying code, but does require careful buffering support.
  • Support for anchored searches, which permit efficient is_prefix checks for a large number of patterns.

When you combine all of this together along with trying to make everything as fast as possible, what you end up with is enitrely too much code with too much unsafe. Alas, I was not smart enough to figure out how to reduce it. Instead, we will explain it.

Basics

The fundamental problem this crate is trying to solve is to determine the occurrences of possibly many patterns in a haystack. The naive way to solve this is to look for a match for each pattern at each position in the haystack:

for i in 0..haystack.len():
  for p in patterns.iter():
    if haystack[i..].starts_with(p.bytes()):
      return Match(p.id(), i, i + p.bytes().len())

Those four lines are effectively all this crate does. The problem with those four lines is that they are very slow, especially when you're searching for a large number of patterns.

While there are many different algorithms available to solve this, a popular one is Aho-Corasick. It's a common solution because it's not too hard to implement, scales quite well even when searching for thousands of patterns and is generally pretty fast. Aho-Corasick does well here because, regardless of the number of patterns you're searching for, it always visits each byte in the haystack exactly once. This means, generally speaking, adding more patterns to an Aho-Corasick automaton does not make it slower. (Strictly speaking, however, this is not true, since a larger automaton will make less effective use of the CPU's cache.)

Aho-Corasick can be succinctly described as a trie with state transitions between some of the nodes that efficiently instruct the search algorithm to try matching alternative keys in the automaton. The trick is that these state transitions are arranged such that each byte of input needs to be inspected only once. These state transitions are typically called "failure transitions," because they instruct the searcher (the thing traversing the automaton while reading from the haystack) what to do when a byte in the haystack does not correspond to a valid transition in the current state of the trie.

More formally, a failure transition points to a state in the automaton that may lead to a match whose prefix is a proper suffix of the path traversed through the trie so far. (If no such proper suffix exists, then the failure transition points back to the start state of the trie, effectively restarting the search.) This is perhaps simpler to explain pictorally. For example, let's say we built an Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. The trie looks like this:

     a - S1 - b - S2 - c - S3 - d - S4*
    /
S0 - c - S5 - e - S6 - f - S7*

where states marked with a * are match states (meaning, the search algorithm should stop and report a match to the caller).

So given this trie, it should be somewhat straight-forward to see how it can be used to determine whether any particular haystack starts with either abcd or cef. It's easy to express this in code:

fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool {
  let mut state_id = trie.start();
  // If the empty pattern is in trie, then state_id is a match state.
  if trie.is_match(state_id) {
    return true;
  }
  for (i, &b) in haystack.iter().enumerate() {
    state_id = match trie.next_state(state_id, b) {
      Some(id) => id,
      // If there was no transition for this state and byte, then we know
      // the haystack does not start with one of the patterns in our trie.
      None => return false,
    };
    if trie.is_match(state_id) {
      return true;
    }
  }
  false
}

And that's pretty much it. All we do is move through the trie starting with the bytes at the beginning of the haystack. If we find ourselves in a position where we can't move, or if we've looked through the entire haystack without seeing a match state, then we know the haystack does not start with any of the patterns in the trie.

The meat of the Aho-Corasick algorithm is in how we add failure transitions to our trie to keep searching efficient. Specifically, it permits us to not only check whether a haystack starts with any one of a number of patterns, but rather, whether the haystack contains any of a number of patterns anywhere in the haystack.

As mentioned before, failure transitions connect a proper suffix of the path traversed through the trie before, with a path that leads to a match that has a prefix corresponding to that proper suffix. So in our case, for patterns abcd and cef, with a haystack abcef, we want to transition to state S5 (from the diagram above) from S3 upon seeing that the byte following c is not d. Namely, the proper suffix in this example is c, which is a prefix of cef. So the modified diagram looks like this:

     a - S1 - b - S2 - c - S3 - d - S4*
    /                      /
   /       ----------------
  /       /
S0 - c - S5 - e - S6 - f - S7*

One thing that isn't shown in this diagram is that all states have a failure transition, but only S3 has a non-trivial failure transition. That is, all other states have a failure transition back to the start state. So if our haystack was abzabcd, then the searcher would transition back to S0 after seeing z, which effectively restarts the search. (Because there is no pattern in our trie that has a prefix of bz or z.)

The code for traversing this automaton or finite state machine (it is no longer just a trie) is not that much different from the has_prefix code above:

fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool {
  let mut state_id = fsm.start();
  // If the empty pattern is in fsm, then state_id is a match state.
  if fsm.is_match(state_id) {
    return true;
  }
  for (i, &b) in haystack.iter().enumerate() {
    // While the diagram above doesn't show this, we may wind up needing
    // to follow multiple failure transitions before we land on a state
    // in which we can advance. Therefore, when searching for the next
    // state, we need to loop until we don't see a failure transition.
    //
    // This loop terminates because the start state has no empty
    // transitions. Every transition from the start state either points to
    // another state, or loops back to the start state.
    loop {
      match fsm.next_state(state_id, b) {
        Some(id) => {
          state_id = id;
          break;
        }
        // Unlike our code above, if there was no transition for this
        // state, then we don't quit. Instead, we look for this state's
        // failure transition and follow that instead.
        None => {
          state_id = fsm.next_fail_state(state_id);
        }
      };
    }
    if fsm.is_match(state_id) {
      return true;
    }
  }
  false
}

Other than the complication around traversing failure transitions, this code is still roughly "traverse the automaton with bytes from the haystack, and quit when a match is seen."

And that concludes our section on the basics. While we didn't go deep into how the automaton is built (see src/nfa.rs, which has detailed comments about that), the basic structure of Aho-Corasick should be reasonably clear.

NFAs and DFAs

There are generally two types of finite automata: non-deterministic finite automata (NFA) and deterministic finite automata (DFA). The difference between them is, principally, that an NFA can be in multiple states at once. This is typically accomplished by things called epsilon transitions, where one could move to a new state without consuming any bytes from the input. (The other mechanism by which NFAs can be in more than one state is where the same byte in a particular state transitions to multiple distinct states.) In contrast, a DFA can only ever be in one state at a time. A DFA has no epsilon transitions, and for any given state, a byte transitions to at most one other state.

By this formulation, the Aho-Corasick automaton described in the previous section is an NFA. This is because failure transitions are, effectively, epsilon transitions. That is, whenever the automaton is in state S, it is actually in the set of states that are reachable by recursively following failure transitions from S. (This means that, for example, the start state is always active since the start state is reachable via failure transitions from any state in the automaton.)

NFAs have a lot of nice properties. They tend to be easier to construct, and also tend to use less memory. However, their primary downside is that they are typically slower to execute. For example, the code above showing how to search with an Aho-Corasick automaton needs to potentially iterate through many failure transitions for every byte of input. While this is a fairly small amount of overhead, this can add up, especially if the automaton has a lot of overlapping patterns with a lot of failure transitions.

A DFA's search code, by contrast, looks like this:

fn contains(dfa: &DFA, haystack: &[u8]) -> bool {
  let mut state_id = dfa.start();
  // If the empty pattern is in dfa, then state_id is a match state.
  if dfa.is_match(state_id) {
    return true;
  }
  for (i, &b) in haystack.iter().enumerate() {
    // An Aho-Corasick DFA *never* has a missing state that requires
    // failure transitions to be followed. One byte of input advances the
    // automaton by one state. Always.
    state_id = dfa.next_state(state_id, b);
    if dfa.is_match(state_id) {
      return true;
    }
  }
  false
}

The search logic here is much simpler than for the NFA, and this tends to translate into significant performance benefits as well, since there's a lot less work being done for each byte in the haystack. How is this accomplished? It's done by pre-following all failure transitions for all states for all bytes in the alphabet, and then building a single state transition table. Building this DFA can be much more costly than building the NFA, and use much more memory, but the better performance can be worth it.

Users of this crate can actually choose between using an NFA or a DFA. By default, an NFA is used, because it typically strikes the best balance between space usage and search performance. But the DFA option is available for cases where a little extra memory and upfront time building the automaton is okay. For example, the AhoCorasick::auto_configure and AhoCorasickBuilder::auto_configure methods will enable the DFA setting if there are a small number of patterns.

More DFA tricks

As described in the previous section, one of the downsides of using a DFA is that it uses more memory and can take longer to build. One small way of mitigating these concerns is to map the alphabet used by the automaton into a smaller space. Typically, the alphabet of a DFA has 256 elements in it: one element for each possible value that fits into a byte. However, in many cases, one does not need the full alphabet. For example, if all patterns in an Aho-Corasick automaton are ASCII letters, then this only uses up 52 distinct bytes. As far as the automaton is concerned, the rest of the 204 bytes are indistinguishable from one another: they will never disrciminate between a match or a non-match. Therefore, in cases like that, the alphabet can be shrunk to just 53 elements. One for each ASCII letter, and then another to serve as a placeholder for every other unused byte.

In practice, this library doesn't quite compute the optimal set of equivalence classes, but it's close enough in most cases. The key idea is that this then allows the transition table for the DFA to be potentially much smaller. The downside of doing this, however, is that since the transition table is defined in terms of this smaller alphabet space, every byte in the haystack must be re-mapped to this smaller space. This requires an additional 256-byte table. In practice, this can lead to a small search time hit, but it can be difficult to measure. Moreover, it can sometimes lead to faster search times for bigger automata, since it could be difference between more parts of the automaton staying in the CPU cache or not.

One other trick for DFAs employed by this crate is the notion of premultiplying state identifiers. Specifically, the normal way to compute the next transition in a DFA is via the following (assuming that the transition table is laid out sequentially in memory, in row-major order, where the rows are states):

next_state_id = dfa.transitions[current_state_id * 256 + current_byte]

However, since the value 256 is a fixed constant, we can actually premultiply the state identifiers in the table when we build the table initially. Then, the next transition computation simply becomes:

next_state_id = dfa.transitions[current_state_id + current_byte]

This doesn't seem like much, but when this is being executed for every byte of input that you're searching, saving that extra multiplication instruction can add up.

The same optimization works even when equivalence classes are enabled, as described above. The only difference is that the premultiplication is by the total number of equivalence classes instead of 256.

There isn't much downside to premultiplying state identifiers, other than the fact that you may need to choose a bigger integer representation than you would otherwise. For example, if you don't premultiply state identifiers, then an automaton that uses u8 as a state identifier can hold up to 256 states. However, if they are premultiplied, then it can only hold up to floor(256 / len(alphabet)) states. Thus premultiplication impacts how compact your DFA can be. In practice, it's pretty rare to use u8 as a state identifier, so premultiplication is usually a good thing to do.

Both equivalence classes and premultiplication are tuneable parameters via the AhoCorasickBuilder type, and both are enabled by default.

Match semantics

One of the more interesting things about this implementation of Aho-Corasick that (as far as this author knows) separates it from other implementations, is that it natively supports leftmost-first and leftmost-longest match semantics. Briefly, match semantics refer to the decision procedure by which searching will disambiguate matches when there are multiple to choose from:

  • standard match semantics emits matches as soon as they are detected by the automaton. This is typically equivalent to the textbook non-overlapping formulation of Aho-Corasick.
  • leftmost-first match semantics means that 1) the next match is the match starting at the leftmost position and 2) among multiple matches starting at the same leftmost position, the match corresponding to the pattern provided first by the caller is reported.
  • leftmost-longest is like leftmost-first, except when there are multiple matches starting at the same leftmost position, the pattern corresponding to the longest match is returned.

(The crate API documentation discusses these differences, with examples, in more depth on the MatchKind type.)

The reason why supporting these match semantics is important is because it gives the user more control over the match procedure. For example, leftmost-first permits users to implement match priority by simply putting the higher priority patterns first. Leftmost-longest, on the other hand, permits finding the longest possible match, which might be useful when trying to find words matching a dictionary. Additionally, regex engines often want to use Aho-Corasick as an optimization when searching for an alternation of literals. In order to preserve correct match semantics, regex engines typically can't use the standard textbook definition directly, since regex engines will implement either leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics.

Supporting leftmost semantics requires a couple key changes:

  • Constructing the Aho-Corasick automaton changes a bit in both how the trie is constructed and how failure transitions are found. Namely, only a subset of the failure transitions are added. Specifically, only the failure transitions that either do not occur after a match or do occur after a match but preserve that match are kept. (More details on this can be found in src/nfa.rs.)
  • The search algorithm changes slightly. Since we are looking for the leftmost match, we cannot quit as soon as a match is detected. Instead, after a match is detected, we must keep searching until either the end of the input or until a dead state is seen. (Dead states are not used for standard match semantics. Dead states mean that searching should stop after a match has been found.)

Other implementations of Aho-Corasick do support leftmost match semantics, but they do it with more overhead at search time, or even worse, with a queue of matches and sophisticated hijinks to disambiguate the matches. While our construction algorithm becomes a bit more complicated, the correct match semantics fall out from the structure of the automaton itself.

Overlapping matches

One of the nice properties of an Aho-Corasick automaton is that it can report all possible matches, even when they overlap with one another. In this mode, the match semantics don't matter, since all possible matches are reported. Overlapping searches work just like regular searches, except the state identifier at which the previous search left off is carried over to the next search, so that it can pick up where it left off. If there are additional matches at that state, then they are reported before resuming the search.

Enabling leftmost-first or leftmost-longest match semantics causes the automaton to use a subset of all failure transitions, which means that overlapping searches cannot be used. Therefore, if leftmost match semantics are used, attempting to do an overlapping search will panic. Thus, to get overlapping searches, the caller must use the default standard match semantics. This behavior was chosen because there are only two alternatives, which were deemed worse:

  • Compile two automatons internally, one for standard semantics and one for the semantics requested by the caller (if not standard).
  • Create a new type, distinct from the AhoCorasick type, which has different capabilities based on the configuration options.

The first is untenable because of the amount of memory used by the automaton. The second increases the complexity of the API too much by adding too many types that do similar things. It is conceptually much simpler to keep all searching isolated to a single type. Callers may query whether the automaton supports overlapping searches via the AhoCorasick::supports_overlapping method.

Stream searching

Since Aho-Corasick is an automaton, it is possible to do partial searches on partial parts of the haystack, and then resume that search on subsequent pieces of the haystack. This is useful when the haystack you're trying to search is not stored contiguously in memory, or if one does not want to read the entire haystack into memory at once.

Currently, only standard semantics are supported for stream searching. This is some of the more complicated code in this crate, and is something I would very much like to improve. In particular, it currently has the restriction that it must buffer at least enough of the haystack in memory in order to fit the longest possible match. The difficulty in getting stream searching right is that the implementation choices (such as the buffer size) often impact what the API looks like and what it's allowed to do.

Prefilters

In some cases, Aho-Corasick is not the fastest way to find matches containing multiple patterns. Sometimes, the search can be accelerated using highly optimized SIMD routines. For example, consider searching the following patterns:

Sherlock
Moriarty
Watson

It is plausible that it would be much faster to quickly look for occurrences of the leading bytes, S, M or W, before trying to start searching via the automaton. Indeed, this is exactly what this crate will do.

When there are more than three distinct starting bytes, then this crate will look for three distinct bytes occurring at any position in the patterns, while preferring bytes that are heuristically determined to be rare over others. For example:

Abuzz
Sanchez
Vasquez
Topaz
Waltz

Here, we have more than 3 distinct starting bytes, but all of the patterns contain z, which is typically a rare byte. In this case, the prefilter will scan for z, back up a bit, and then execute the Aho-Corasick automaton.

If all of that fails, then a packed multiple substring algorithm will be attempted. Currently, the only algorithm available for this is Teddy, but more may be added in the future. Teddy is unlike the above prefilters in that it confirms its own matches, so when Teddy is active, it might not be necessary for Aho-Corasick to run at all. (See Automaton::leftmost_find_at_no_state_imp in src/automaton.rs.) However, the current Teddy implementation only works in x86_64 and when SSSE3 or AVX2 are available, and moreover, only works well when there are a small number of patterns (say, less than 100). Teddy also requires the haystack to be of a certain length (more than 16-34 bytes). When the haystack is shorter than that, Rabin-Karp is used instead. (See src/packed/rabinkarp.rs.)

There is a more thorough description of Teddy at src/packed/teddy/README.md.