api: add memmem implementation, initially from bstr

This commit primarily adds vectorized substring search routines in a new memmem sub-module. They were originally taken from bstr, but heavily modified to incorporate a variant of the "generic SIMD" algorithm[1]. The main highlights: * We guarantee `O(m + n)` time complexity and constant space complexity. * Two-Way is the primary implementation that can handle all cases. * Vectorized variants handle a number of common cases. * Vectorized code uses a heuristic informed by a frequency background distribution of bytes, originally devised inside the regex crate. This makes it more likely that searching will spend more time in the fast vector loops. While adding memmem to this crate is perhaps a bit of a scope increase, I think it fits well. It also puts a core primitive, substring search, very low in the dependency DAG and therefore making it widely available. For example, it is intended to use these new routines in the regex, aho-corasick and bstr crates. This commit does a number of other things, mainly as a result of convenience. It drastically improves test coverage for substring search (as compared to what bstr had), completely overhauls the benchmark suite to make it more comprehensive and adds `cargo fuzz` support for all API items in the crate. Closes #58, Closes #72 [1] - http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
2026-07-01 08:14:31 -04:00 · 2021-01-18 19:18:00 -05:00
parent 58c227886a
commit 448ec9e639
85 changed files with 193372 additions and 818 deletions
@@ -15,6 +15,8 @@ jobs:
      CARGO: cargo
      # When CARGO is set to CROSS, TARGET is set to `--target matrix.target`.
      TARGET:
+      # Make quickcheck run more tests for hopefully better coverage.
+      QUICKCHECK_TESTS: 100000
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
@@ -37,8 +37,12 @@ libc = { version = "0.2.18", default-features = false, optional = true }
 [dev-dependencies]
 quickcheck = { version = "1.0.3", default-features = false }

+[profile.release]
+debug = true
+
 [profile.bench]
 debug = true

 [profile.test]
 opt-level = 3
+debug = true
@@ -1,6 +1,6 @@
 memchr
 ======
-The `memchr` crate provides heavily optimized routines for searching bytes.
+This library provides heavily optimized routines for string search primitives.

 [![Build status](https://github.com/BurntSushi/rust-memchr/workflows/ci/badge.svg)](https://github.com/BurntSushi/rust-memchr/actions)
 [![](https://meritbadge.herokuapp.com/memchr)](https://crates.io/crates/memchr)
@@ -15,23 +15,15 @@ Dual-licensed under MIT or the [UNLICENSE](https://unlicense.org/).

 ### Overview

-The `memchr` function is traditionally provided by libc, but its
-performance can vary significantly depending on the specific
-implementation of libc that is used. They can range from manually tuned
-Assembly implementations (like that found in GNU's libc) all the way to
-non-vectorized C implementations (like that found in MUSL).
+* The top-level module provides routines for searching for 1, 2 or 3 bytes
+  in the forward or reverse direction. When searching for more than one byte,
+  positions are considered a match if the byte at that position matches any
+  of the bytes.
+* The `memmem` sub-module provides forward and reverse substring search
+  routines.

-To smooth out the differences between implementations of libc, at least
-on `x86_64` for Rust 1.27+, this crate provides its own implementation of
-`memchr` that should perform competitively with the one found in GNU's libc.
-The implementation is in pure Rust and has no dependency on a C compiler or an
-Assembler.
-
-Additionally, GNU libc also provides an extension, `memrchr`. This crate
-provides its own implementation of `memrchr` as well, on top of `memchr2`,
-`memchr3`, `memrchr2` and `memrchr3`. The difference between `memchr` and
-`memchr2` is that `memchr2` permits finding all occurrences of two bytes
-instead of one. Similarly for `memchr3`.
+In all such cases, routines operate on `&[u8]` without regard to encoding. This
+is exactly what you want when searching either UTF-8 or arbitrary bytes.

 ### Compiling without the standard library

@@ -43,10 +35,9 @@ memchr links to the standard library by default, but you can disable the
 memchr = { version = "2", default-features = false }
 ```

-On x86 platforms, when the `std` feature is disabled, the SSE2
-implementation of memchr will be used in compilers that support it. When
-`std` is enabled, the AVX implementation of memchr will be used if the CPU
-is determined to support it at runtime.
+On x86 platforms, when the `std` feature is disabled, the SSE2 accelerated
+implementations will be used. When `std` is enabled, AVX accelerated
+implementations will be used if the CPU is determined to support it at runtime.

 ### Using libc

@@ -58,11 +49,11 @@ using `memchr` from libc is desirable and a vectorized routine is not otherwise
 available in this crate, then enabling the `libc` feature will use libc's
 version of `memchr`.

-The rest of the functions in this crate, e.g., `memchr2` or `memrchr3`, are not
-a standard part of libc, so they will always use the implementations in this
-crate. One exception to this is `memrchr`, which is an extension commonly found
-on Linux. On Linux, `memrchr` is used in precisely the same scenario as
-`memchr`, as described above.
+The rest of the functions in this crate, e.g., `memchr2` or `memrchr3` and the
+substring search routines, will always use the implementations in this crate.
+One exception to this is `memrchr`, which is an extension in `libc` found on
+Linux. On Linux, `memrchr` is used in precisely the same scenario as `memchr`,
+as described above.


 ### Minimum Rust version policy
@@ -77,3 +68,20 @@ version of Rust.

 In general, this crate will be conservative with respect to the minimum
 supported version of Rust.
+
+
+### Testing strategy
+
+Given the complexity of the code in this crate, along with the pervasive use
+of `unsafe`, this crate has an extensive testing strategy. It combines multiple
+approaches:
+
+* Hand-written tests.
+* Exhaustive-style testing meant to exercise all possible branching and offset
+  calculations.
+* Property based testing through [`quickcheck`](https://github.com/BurntSushi/quickcheck).
+* Fuzz testing through [`cargo fuzz`](https://github.com/rust-fuzz/cargo-fuzz).
+* A huge suite of benchmarks that are also run as tests. Benchmarks always
+  confirm that the expected result occurs.
+
+Improvements to the testing infrastructue are very welcome.
@@ -18,6 +18,10 @@ harness = false
 path = "src/bench.rs"

 [dependencies]
+bstr = "0.2.15"
 criterion = "0.3.3"
 memchr = { version = "*", path = ".." }
 libc = "0.2.81"
+regex = "1.4.5"
+sliceslice = "0.2.1"
+twoway = "0.2.1"
@@ -0,0 +1,12 @@
+These were downloaded and derived from the Open Subtitles data set:
+https://opus.nlpl.eu/OpenSubtitles-v2018.php
+
+The specific way in which they were modified has been lost to time, but it's
+likely they were just a simple truncation based on target file sizes for
+various benchmarks.
+
+The main reason why we have them is that it gives us a way to test similar
+inputs on non-ASCII text. Normally this wouldn't matter for a substring search
+implementation, but because of the heuristics used to pick a priori determined
+"rare bytes" to base a prefilter on, it's possible for this heuristic to do
+more poorly on non-ASCII text than one might expect.
@@ -0,0 +1,39 @@
+Now you can tell 'em.
+What for are you mixing in?
+Maybe I don't like to see kids get hurt.
+Break any bones, son?
+He's got a knife behind his collar!
+- There's a stirrup.
+You want a lift?
+- No.
+- Why not?
+- I'm beholden to you, mister.
+Couldn't we just leave it that way?
+- Morning.
+- Morning.
+- Put him up?
+- For how long?
+- I wouldn't know.
+- It'll be two bits for oats.
+- Ain't I seen you before?
+- Depends on where you've been.
+- I follow the railroad, mostly.
+- Could be you've seen me.
+- It'll be four bits if he stays the night.
+- Fair enough.
+Morning.
+Did a man ride in today - tall, sort of heavyset?
+- You mean him, Mr Renner?
+- Not him.
+This one had a scar.
+Along his cheek?
+No, sir.
+I don't see no man with a scar.
+I guess maybe I can have some apple pie and coffee.
+I guess you could have eggs with bacon if you wanted eggs with bacon.
+- Hello, Charlie.
+- Hello, Grant.
+It's good to see you, Charlie.
+It's awful good to see you.
+It's good to see you too.
+Doc you're beginning to sound like Sherlock Holmes.
@@ -0,0 +1 @@
+Sound like Sherlock Holmes.
@@ -0,0 +1,2 @@
+I saw you before but I didn't think you were this young
+Doc you're beginning to sound like Sherlock Holmes.
@@ -0,0 +1,18 @@
+-Две недели не даешь мне прохода.
+Вот и действуй, чем ты рискуешь?
+Я думал, что сделаю тебя счастливой.
+Тоже мне счастье.
+Муж не дает ни гроша, и у любовника ума не хватает подумать о деньгах.
+- Хорошенькое счастье.
+- Извини, я думал, ты любишь меня.
+Ну люблю, люблю тебя, но и не хочу, чтобы все началось как в прошлый раз.
+Ты не права.
+У меня для тебя сюрприз.
+Шлихтовальная машина, ты о ней давно мечтала.
+-Для костей?
+- Нет, настоящая.
+Хочешь, приходи за ней вечером.
+Я тебе не девочка.
+Была бы ты девочкой, я бы тебе ее не купил.
+Я люблю тебя
+Митч МакКафи, летающий Шерлок Холмс.
@@ -0,0 +1 @@
+летающий Шерлок Холмс.
@@ -0,0 +1,2 @@
+Это - одно из самых поразительных недавних открытий науки.
+Митч МакКафи, летающий Шерлок Холмс.
@@ -0,0 +1,28 @@
+魯哇克香貓咖啡 世界上最稀有的飲品 Kopi luwak.
+the rarest beverage in the world.
+嘗一小口 Take a whiff.
+來 Go ahead.
+寇爾先生 董事會已準備好聽你的提案 Uh, mr.
+cole, the board is ready to hear your proposal.
+等一下下 Hold on just a second.
+來 繼續 Go ahead.
+go on.
+怎樣 Well?
+真不錯 Really good.
+真不錯 Really good.
+寇爾先生?
+Mr.
+cole.
+sir?
+吉姆 你知道庸俗是什麼嗎 Do you know what a philistine is, jim?
+先生 我叫理查德 Sir, it's richard.
+沒錯 費爾 出動你的如簧巧舌吧 That's right, phil.
+give them the spiel.
+謝謝 主席先生 主管們 Thank you, mr.
+chairman, fellow supervisors.
+我們寇爾集團財務的管理不善 We at the cole group feel the decline of the winwood hospital...
+直接造成了溫伍德醫院的衰敗 ...is a direct result of significant fiscal mismanagement.
+請原諒 我們醫院...
+I beg your pardon, this hospital...
+日常開支近2倍 overhead costs are nearly double.
+帽子不错 汤姆 夏洛克·福尔摩斯
@@ -0,0 +1 @@
+汤姆 夏洛克·福尔摩斯
@@ -0,0 +1,3 @@
+谁是早餐界的冠军?
+你突然来信说最近要搬到这里
+帽子不错 汤姆 夏洛克·福尔摩斯
@@ -0,0 +1,54 @@
+These data sets are specifically crafted to try and defeat heuristic
+optimizations in various substring search implementations. The point of these
+is to make the costs of those heuristics clearer. In particular, the main idea
+behind heuristics is to sell out some rare or edge cases in favor of making
+some common cases *a lot* faster (potentially by orders of magnitude). The key
+to this is to make sure that those edge cases are impacted at tolerable levels.
+
+Below is a description of each.
+
+* `repeated-rare-*`: This is meant to be used with the needle `abczdef`. This
+  input defeats a heuristic in the old bstr and regex substring implementations
+  that looked for a rare byte (in this case, `z`) to run memchr on before
+  looking for an actual match. This particular input causes that heuristic to
+  stop on every byte in the input. In regex's case in particular, this causes
+  `O(mn)` time complexity. (In the case of `bstr`, it does a little better by
+  stopping this heuristic after a number of tries once it becomes clear that it
+  is ineffective.)
+* `defeat-simple-vector`: The corpus consists of `qaz` repeated over and over
+  again. The intended needle is `qbz`. This is meant to be difficult for the
+  "generic SIMD" algorithm[1] to handle. Namely, it will repeatedly find a
+  candidate match via the `q` and `z` bytes in the needle, but the overall
+  match will fail at the `memcmp` phase. Nevertheless, optimized versions of
+  [1] still do reasonably well on this benchmark because the `memcmp` can be
+  specialized to a single `u32` unaligned load and compare.
+* `defeat-simple-vector-freq`: This is similarish to `defeat-simple-vector`,
+  except it also attempts to defeat heuristic frequency analysis. The corpus
+  consists of `qjaz` repeated over and over again, with the intended needle
+  being `qja{49}z`. Heuristic frequency analysis might try either the `q` or
+  the `j`, in addition to `z`. Given the nature of the corpus, this will result
+  in a lot of false positive candidates, thus leading to an ineffective
+  prefilter.
+* `defeat-simple-vector-repeated`: This combines the "repeated-rare" and
+  "defeat-simple-vector" inputs. The corpus consists of `z` entirely, with only
+  the second to last byte being changed to `a`. The intended needle is
+  `z{135}az`. The key here is that in [1], a candidate match will be found at
+  every position in the haystack. And since the needle is very large, this will
+  result in a full `memcmp` call out. [1] effectively drowns in `memcmp` being
+  called at every position in the haystack. The algorithm in this crate does
+  a bit better by noticing that the prefilter is ineffective and falling back
+  to standard Two-Way.
+* `md5-huge`: This file contains one md5 hash per line for each word in the
+  `../sliceslice/words.txt` corpus. The intent of this benchmark is to defeat
+  frequency heuristics by using a corpus comprised of random data. That is,
+  no one bytes should be significantly more frequent than any other.
+* `random-huge`: Similar to `md5-huge`, but with longer lines and more
+  princpally random data. Generated via
+  `dd if=/dev/urandom bs=32 count=10000 | xxd -ps -c32`.
+  This was derived from a real world benchmark reported to ripgrep[2].
+  In particular, it originally motivated the addition of Boyer-Moore to
+  the regex crate, but now this case is handled just fine by the memmem
+  implementation in this crate.
+
+[1]: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
+[2]: https://github.com/BurntSushi/ripgrep/issues/617
@@ -0,0 +1 @@
+zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
@@ -0,0 +1,6 @@
+These were the original inputs used for the memchr benchmarks. In theory, they
+could be replaced with the subtitle text in order to trim down the number of
+inputs we use in the benchmark suite.
+
+The nice thing about the subtitle corpus is that it gives us a translation into
+Russian and Chinese, which lets us measure our code on non-ASCII text.
@@ -0,0 +1,24 @@
+These benchmark inputs were taken from the sliceslice [project benchmarks][1].
+
+These inputs drive two benchmarks, one on short haystacks and the other on long
+haystacks, with a slightly unusual but interesting configuration. Neither of
+these benchmarks include the time it takes to build a searcher. They only
+measure actual search time.
+
+The short haystack benchmark starts by loading all of the words in `words.txt`
+into memory and sorting them in ascending order by their length. Then, a
+substring searcher is created for each of these words in the same order. The
+actual benchmark consists of executing each searcher once on every needle that
+appears after it in the list. In essence, this benchmark tests how quickly the
+implementation can deal with tiny haystacks. The results of this benchmark tend
+to come down to how much overhead the implementation has. In other words, this
+benchmark tests latency.
+
+The long haystack benchmark has a setup similar to the short haystack
+benchmark, except it also loads the contents of `i386.txt` into memory. The
+actual benchmark itself executes each of the searchers built (from `words.txt`)
+on the `i386.txt` haystack. This benchmark, executing on a much longer
+haystack, tests throughput as opposed to latency across a wide variety of
+needles.
+
+[1]: https://github.com/cloudflare/sliceslice-rs
@@ -1,30 +1,47 @@
 use criterion::{
-    criterion_group, criterion_main, Bencher, Benchmark, Criterion, Throughput,
+    criterion_group, criterion_main, Bencher, Criterion, Throughput,
 };

 mod data;
 mod memchr;
+mod memmem;

 fn all(c: &mut Criterion) {
    memchr::all(c);
+    memmem::all(c);
 }

+/// A convenience function for defining a Criterion benchmark using our own
+/// conventions and a common config.
+///
+/// Note that we accept `bench` as a boxed closure to avoid the costs
+/// of monomorphization. Particularly with the memchr benchmarks, this
+/// function getting monomorphized (which also monomorphizes via Criterion's
+/// `bench_function`) bloats compile times dramatically (by an order of
+/// magnitude). This is okay to do since `bench` isn't the actual thing we
+/// measure. The measurement comes from running `Bencher::iter` from within`
+/// bench. So the dynamic dispatch is okay here.
 fn define(
    c: &mut Criterion,
-    group_name: &str,
-    bench_name: &str,
+    name: &str,
    corpus: &[u8],
-    bench: impl FnMut(&mut Bencher<'_>) + 'static,
+    bench: Box<dyn FnMut(&mut Bencher<'_>) + 'static>,
 ) {
-    let tput = Throughput::Bytes(corpus.len() as u64);
-    // let benchmark = Benchmark::new(bench_name, bench).throughput(tput);
-
-    let benchmark = Benchmark::new(bench_name, bench)
-        .throughput(tput)
-        .sample_size(30)
+    // I don't really "get" the whole Criterion benchmark group thing. I just
+    // want a flat namespace to define all benchmarks. The only thing that
+    // matters to me is that we can group benchmarks arbitrarily using the
+    // name only. So we play Criterion's game by splitting our benchmark name
+    // on the first flash.
+    //
+    // N.B. We don't include the slash, since Criterion automatically adds it.
+    let mut it = name.splitn(2, "/");
+    let (group_name, bench_name) = (it.next().unwrap(), it.next().unwrap());
+    c.benchmark_group(group_name)
+        .throughput(Throughput::Bytes(corpus.len() as u64))
+        .sample_size(10)
        .warm_up_time(std::time::Duration::from_millis(500))
-        .measurement_time(std::time::Duration::from_secs(2));
-    c.bench(group_name, benchmark);
+        .measurement_time(std::time::Duration::from_secs(2))
+        .bench_function(bench_name, bench);
 }

 criterion_group!(does_not_matter, all);
@@ -1,6 +1,64 @@
+#![allow(dead_code)]
+
 pub const SHERLOCK_HUGE: &'static [u8] =
-    include_bytes!("../data/sherlock-holmes-huge.txt");
+    include_bytes!("../data/sherlock/huge.txt");
 pub const SHERLOCK_SMALL: &'static [u8] =
-    include_bytes!("../data/sherlock-holmes-small.txt");
+    include_bytes!("../data/sherlock/small.txt");
 pub const SHERLOCK_TINY: &'static [u8] =
-    include_bytes!("../data/sherlock-holmes-tiny.txt");
+    include_bytes!("../data/sherlock/tiny.txt");
+
+pub const SUBTITLE_EN_HUGE: &'static str =
+    include_str!("../data/opensubtitles/en-huge.txt");
+pub const SUBTITLE_EN_MEDIUM: &'static str =
+    include_str!("../data/opensubtitles/en-medium.txt");
+pub const SUBTITLE_EN_SMALL: &'static str =
+    include_str!("../data/opensubtitles/en-small.txt");
+pub const SUBTITLE_EN_TINY: &'static str =
+    include_str!("../data/opensubtitles/en-tiny.txt");
+pub const SUBTITLE_EN_TEENY: &'static str =
+    include_str!("../data/opensubtitles/en-teeny.txt");
+
+pub const SUBTITLE_RU_HUGE: &'static str =
+    include_str!("../data/opensubtitles/ru-huge.txt");
+pub const SUBTITLE_RU_MEDIUM: &'static str =
+    include_str!("../data/opensubtitles/ru-medium.txt");
+pub const SUBTITLE_RU_SMALL: &'static str =
+    include_str!("../data/opensubtitles/ru-small.txt");
+pub const SUBTITLE_RU_TINY: &'static str =
+    include_str!("../data/opensubtitles/ru-tiny.txt");
+pub const SUBTITLE_RU_TEENY: &'static str =
+    include_str!("../data/opensubtitles/ru-teeny.txt");
+
+pub const SUBTITLE_ZH_HUGE: &'static str =
+    include_str!("../data/opensubtitles/zh-huge.txt");
+pub const SUBTITLE_ZH_MEDIUM: &'static str =
+    include_str!("../data/opensubtitles/zh-medium.txt");
+pub const SUBTITLE_ZH_SMALL: &'static str =
+    include_str!("../data/opensubtitles/zh-small.txt");
+pub const SUBTITLE_ZH_TINY: &'static str =
+    include_str!("../data/opensubtitles/zh-tiny.txt");
+pub const SUBTITLE_ZH_TEENY: &'static str =
+    include_str!("../data/opensubtitles/zh-teeny.txt");
+
+pub const PATHOLOGICAL_MD5_HUGE: &'static str =
+    include_str!("../data/pathological/md5-huge.txt");
+pub const PATHOLOGICAL_RANDOM_HUGE: &'static str =
+    include_str!("../data/pathological/random-huge.txt");
+pub const PATHOLOGICAL_REPEATED_RARE_HUGE: &'static str =
+    include_str!("../data/pathological/repeated-rare-huge.txt");
+pub const PATHOLOGICAL_REPEATED_RARE_SMALL: &'static str =
+    include_str!("../data/pathological/repeated-rare-small.txt");
+pub const PATHOLOGICAL_DEFEAT_SIMPLE_VECTOR: &'static str =
+    include_str!("../data/pathological/defeat-simple-vector.txt");
+pub const PATHOLOGICAL_DEFEAT_SIMPLE_VECTOR_FREQ: &'static str =
+    include_str!("../data/pathological/defeat-simple-vector-freq.txt");
+pub const PATHOLOGICAL_DEFEAT_SIMPLE_VECTOR_REPEATED: &'static str =
+    include_str!("../data/pathological/defeat-simple-vector-repeated.txt");
+
+pub const SLICESLICE_I386: &'static str =
+    include_str!("../data/sliceslice/i386.txt");
+pub const SLICESLICE_WORDS: &'static str =
+    include_str!("../data/sliceslice/words.txt");
+
+pub const CODE_RUST_LIBRARY: &'static str =
+    include_str!("../data/code/rust-library.rs");
@@ -12,18 +12,18 @@ use crate::{
    },
 };

-#[path = "../../../src/c.rs"]
+#[path = "../../../src/memchr/c.rs"]
 mod c;
 #[allow(dead_code)]
-#[path = "../../../src/fallback.rs"]
+#[path = "../../../src/memchr/fallback.rs"]
 mod fallback;
 mod imp;
 mod inputs;
-#[path = "../../../src/naive.rs"]
+#[path = "../../../src/memchr/naive.rs"]
 mod naive;

 pub fn all(c: &mut Criterion) {
-    define_memchr_input1(c, "memchr1/rust/huge", HUGE, move |search, b| {
+    define_memchr_input1(c, "memchr1/krate/huge", HUGE, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count,
@@ -31,7 +31,7 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input1(c, "memchr1/rust/small", SMALL, move |search, b| {
+    define_memchr_input1(c, "memchr1/krate/small", SMALL, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count,
@@ -39,7 +39,7 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input1(c, "memchr1/rust/tiny", TINY, move |search, b| {
+    define_memchr_input1(c, "memchr1/krate/tiny", TINY, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count,
@@ -47,7 +47,7 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input1(c, "memchr1/rust/empty", EMPTY, move |search, b| {
+    define_memchr_input1(c, "memchr1/krate/empty", EMPTY, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count,
@@ -175,7 +175,7 @@ pub fn all(c: &mut Criterion) {
        });
    });

-    define_memchr_input2(c, "memchr2/rust/huge", HUGE, move |search, b| {
+    define_memchr_input2(c, "memchr2/krate/huge", HUGE, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count,
@@ -187,7 +187,7 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input2(c, "memchr2/rust/small", SMALL, move |search, b| {
+    define_memchr_input2(c, "memchr2/krate/small", SMALL, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count,
@@ -199,7 +199,7 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input2(c, "memchr2/rust/tiny", TINY, move |search, b| {
+    define_memchr_input2(c, "memchr2/krate/tiny", TINY, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count,
@@ -211,7 +211,7 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input2(c, "memchr2/rust/empty", EMPTY, move |search, b| {
+    define_memchr_input2(c, "memchr2/krate/empty", EMPTY, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count,
@@ -342,7 +342,7 @@ pub fn all(c: &mut Criterion) {
        });
    });

-    define_memchr_input3(c, "memchr3/rust/huge", HUGE, move |search, b| {
+    define_memchr_input3(c, "memchr3/krate/huge", HUGE, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count + search.byte3.count,
@@ -355,7 +355,7 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input3(c, "memchr3/rust/small", SMALL, move |search, b| {
+    define_memchr_input3(c, "memchr3/krate/small", SMALL, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count + search.byte3.count,
@@ -368,7 +368,7 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input3(c, "memchr3/rust/tiny", TINY, move |search, b| {
+    define_memchr_input3(c, "memchr3/krate/tiny", TINY, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count + search.byte3.count,
@@ -381,7 +381,7 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input3(c, "memchr3/rust/empty", EMPTY, move |search, b| {
+    define_memchr_input3(c, "memchr3/krate/empty", EMPTY, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count + search.byte3.count,
@@ -529,7 +529,7 @@ pub fn all(c: &mut Criterion) {
        });
    });

-    define_memchr_input1(c, "memrchr1/rust/huge", HUGE, move |search, b| {
+    define_memchr_input1(c, "memrchr1/krate/huge", HUGE, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count,
@@ -537,23 +537,20 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input1(c, "memrchr1/rust/small", SMALL, move |search, b| {
-        b.iter(|| {
-            assert_eq!(
-                search.byte1.count,
-                memrchr1_count(search.byte1.byte, search.corpus)
-            );
-        });
-    });
-    define_memchr_input1(c, "memrchr1/rust/tiny", TINY, move |search, b| {
-        b.iter(|| {
-            assert_eq!(
-                search.byte1.count,
-                memrchr1_count(search.byte1.byte, search.corpus)
-            );
-        });
-    });
-    define_memchr_input1(c, "memrchr1/rust/empty", EMPTY, move |search, b| {
+    define_memchr_input1(
+        c,
+        "memrchr1/krate/small",
+        SMALL,
+        move |search, b| {
+            b.iter(|| {
+                assert_eq!(
+                    search.byte1.count,
+                    memrchr1_count(search.byte1.byte, search.corpus)
+                );
+            });
+        },
+    );
+    define_memchr_input1(c, "memrchr1/krate/tiny", TINY, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count,
@@ -561,6 +558,19 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
+    define_memchr_input1(
+        c,
+        "memrchr1/krate/empty",
+        EMPTY,
+        move |search, b| {
+            b.iter(|| {
+                assert_eq!(
+                    search.byte1.count,
+                    memrchr1_count(search.byte1.byte, search.corpus)
+                );
+            });
+        },
+    );

    #[cfg(all(target_os = "linux"))]
    {
@@ -630,7 +640,7 @@ pub fn all(c: &mut Criterion) {
        );
    }

-    define_memchr_input2(c, "memrchr2/rust/huge", HUGE, move |search, b| {
+    define_memchr_input2(c, "memrchr2/krate/huge", HUGE, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count,
@@ -642,31 +652,24 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input2(c, "memrchr2/rust/small", SMALL, move |search, b| {
-        b.iter(|| {
-            assert_eq!(
-                search.byte1.count + search.byte2.count,
-                memrchr2_count(
-                    search.byte1.byte,
-                    search.byte2.byte,
-                    search.corpus,
-                )
-            );
-        });
-    });
-    define_memchr_input2(c, "memrchr2/rust/tiny", TINY, move |search, b| {
-        b.iter(|| {
-            assert_eq!(
-                search.byte1.count + search.byte2.count,
-                memrchr2_count(
-                    search.byte1.byte,
-                    search.byte2.byte,
-                    search.corpus,
-                )
-            );
-        });
-    });
-    define_memchr_input2(c, "memrchr2/rust/empty", EMPTY, move |search, b| {
+    define_memchr_input2(
+        c,
+        "memrchr2/krate/small",
+        SMALL,
+        move |search, b| {
+            b.iter(|| {
+                assert_eq!(
+                    search.byte1.count + search.byte2.count,
+                    memrchr2_count(
+                        search.byte1.byte,
+                        search.byte2.byte,
+                        search.corpus,
+                    )
+                );
+            });
+        },
+    );
+    define_memchr_input2(c, "memrchr2/krate/tiny", TINY, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count,
@@ -678,8 +681,25 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
+    define_memchr_input2(
+        c,
+        "memrchr2/krate/empty",
+        EMPTY,
+        move |search, b| {
+            b.iter(|| {
+                assert_eq!(
+                    search.byte1.count + search.byte2.count,
+                    memrchr2_count(
+                        search.byte1.byte,
+                        search.byte2.byte,
+                        search.corpus,
+                    )
+                );
+            });
+        },
+    );

-    define_memchr_input3(c, "memrchr3/rust/huge", HUGE, move |search, b| {
+    define_memchr_input3(c, "memrchr3/krate/huge", HUGE, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count + search.byte3.count,
@@ -692,33 +712,27 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
-    define_memchr_input3(c, "memrchr3/rust/small", SMALL, move |search, b| {
-        b.iter(|| {
-            assert_eq!(
-                search.byte1.count + search.byte2.count + search.byte3.count,
-                memrchr3_count(
-                    search.byte1.byte,
-                    search.byte2.byte,
-                    search.byte3.byte,
-                    search.corpus,
-                )
-            );
-        });
-    });
-    define_memchr_input3(c, "memrchr3/rust/tiny", TINY, move |search, b| {
-        b.iter(|| {
-            assert_eq!(
-                search.byte1.count + search.byte2.count + search.byte3.count,
-                memrchr3_count(
-                    search.byte1.byte,
-                    search.byte2.byte,
-                    search.byte3.byte,
-                    search.corpus,
-                )
-            );
-        });
-    });
-    define_memchr_input3(c, "memrchr3/rust/empty", EMPTY, move |search, b| {
+    define_memchr_input3(
+        c,
+        "memrchr3/krate/small",
+        SMALL,
+        move |search, b| {
+            b.iter(|| {
+                assert_eq!(
+                    search.byte1.count
+                        + search.byte2.count
+                        + search.byte3.count,
+                    memrchr3_count(
+                        search.byte1.byte,
+                        search.byte2.byte,
+                        search.byte3.byte,
+                        search.corpus,
+                    )
+                );
+            });
+        },
+    );
+    define_memchr_input3(c, "memrchr3/krate/tiny", TINY, move |search, b| {
        b.iter(|| {
            assert_eq!(
                search.byte1.count + search.byte2.count + search.byte3.count,
@@ -731,6 +745,26 @@ pub fn all(c: &mut Criterion) {
            );
        });
    });
+    define_memchr_input3(
+        c,
+        "memrchr3/krate/empty",
+        EMPTY,
+        move |search, b| {
+            b.iter(|| {
+                assert_eq!(
+                    search.byte1.count
+                        + search.byte2.count
+                        + search.byte3.count,
+                    memrchr3_count(
+                        search.byte1.byte,
+                        search.byte2.byte,
+                        search.byte3.byte,
+                        search.corpus,
+                    )
+                );
+            });
+        },
+    );
 }

 fn define_memchr_input1<'i>(
@@ -739,34 +773,22 @@ fn define_memchr_input1<'i>(
    input: Input,
    bench: impl FnMut(Search1, &mut Bencher<'_>) + Clone + 'static,
 ) {
-    if let Some(search) = input.never1() {
-        let mut bench = bench.clone();
-        define(c, group, "never", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.rare1() {
-        let mut bench = bench.clone();
-        define(c, group, "rare", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.uncommon1() {
-        let mut bench = bench.clone();
-        define(c, group, "uncommon", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.common1() {
-        let mut bench = bench.clone();
-        define(c, group, "common", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.verycommon1() {
-        let mut bench = bench.clone();
-        define(c, group, "verycommon", input.corpus, move |b| {
-            bench(search, b)
-        });
-    }
-    if let Some(search) = input.supercommon1() {
-        let mut bench = bench.clone();
-        define(c, group, "supercommon", input.corpus, move |b| {
-            bench(search, b)
-        });
+    macro_rules! def {
+        ($name:expr, $kind:ident) => {
+            if let Some(search) = input.$kind() {
+                let corp = input.corpus;
+                let name = format!("{}/{}", group, $name);
+                let mut bench = bench.clone();
+                define(c, &name, corp, Box::new(move |b| bench(search, b)));
+            }
+        };
    }
+    def!("never", never1);
+    def!("rare", rare1);
+    def!("uncommon", uncommon1);
+    def!("common", common1);
+    def!("verycommon", verycommon1);
+    def!("supercommon", supercommon1);
 }

 fn define_memchr_input2<'i>(
@@ -775,34 +797,22 @@ fn define_memchr_input2<'i>(
    input: Input,
    bench: impl FnMut(Search2, &mut Bencher<'_>) + Clone + 'static,
 ) {
-    if let Some(search) = input.never2() {
-        let mut bench = bench.clone();
-        define(c, group, "never", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.rare2() {
-        let mut bench = bench.clone();
-        define(c, group, "rare", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.uncommon2() {
-        let mut bench = bench.clone();
-        define(c, group, "uncommon", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.common2() {
-        let mut bench = bench.clone();
-        define(c, group, "common", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.verycommon2() {
-        let mut bench = bench.clone();
-        define(c, group, "verycommon", input.corpus, move |b| {
-            bench(search, b)
-        });
-    }
-    if let Some(search) = input.supercommon2() {
-        let mut bench = bench.clone();
-        define(c, group, "supercommon", input.corpus, move |b| {
-            bench(search, b)
-        });
+    macro_rules! def {
+        ($name:expr, $kind:ident) => {
+            if let Some(search) = input.$kind() {
+                let corp = input.corpus;
+                let name = format!("{}/{}", group, $name);
+                let mut bench = bench.clone();
+                define(c, &name, corp, Box::new(move |b| bench(search, b)));
+            }
+        };
    }
+    def!("never", never2);
+    def!("rare", rare2);
+    def!("uncommon", uncommon2);
+    def!("common", common2);
+    def!("verycommon", verycommon2);
+    def!("supercommon", supercommon2);
 }

 fn define_memchr_input3<'i>(
@@ -811,32 +821,20 @@ fn define_memchr_input3<'i>(
    input: Input,
    bench: impl FnMut(Search3, &mut Bencher<'_>) + Clone + 'static,
 ) {
-    if let Some(search) = input.never3() {
-        let mut bench = bench.clone();
-        define(c, group, "never", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.rare3() {
-        let mut bench = bench.clone();
-        define(c, group, "rare", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.uncommon3() {
-        let mut bench = bench.clone();
-        define(c, group, "uncommon", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.common3() {
-        let mut bench = bench.clone();
-        define(c, group, "common", input.corpus, move |b| bench(search, b));
-    }
-    if let Some(search) = input.verycommon3() {
-        let mut bench = bench.clone();
-        define(c, group, "verycommon", input.corpus, move |b| {
-            bench(search, b)
-        });
-    }
-    if let Some(search) = input.supercommon3() {
-        let mut bench = bench.clone();
-        define(c, group, "supercommon", input.corpus, move |b| {
-            bench(search, b)
-        });
+    macro_rules! def {
+        ($name:expr, $kind:ident) => {
+            if let Some(search) = input.$kind() {
+                let corp = input.corpus;
+                let name = format!("{}/{}", group, $name);
+                let mut bench = bench.clone();
+                define(c, &name, corp, Box::new(move |b| bench(search, b)));
+            }
+        };
    }
+    def!("never", never3);
+    def!("rare", rare3);
+    def!("uncommon", uncommon3);
+    def!("common", common3);
+    def!("verycommon", verycommon3);
+    def!("supercommon", supercommon3);
 }
@@ -0,0 +1,833 @@
+/*
+This module defines a common API (by convention) for all of the different
+impls that we benchmark. The intent here is to 1) make it easy to write macros
+for generating benchmark definitions generic over impls and 2) make it easier
+to read the benchmarks themselves and grok how exactly each of the impls are
+being invoked.
+
+The naming scheme of each function follows the pertinent parts of our benchmark
+naming scheme (see parent module docs). Namely, it is
+
+  {impl}/{fwd|rev}/{config}
+
+Where 'impl' is the underlying implementation and 'config' is the manner of
+search. The slash indicates a module boundary. We use modules for this because
+it makes writing macros to define benchmarks for all variants much easier.
+*/
+
+/// memchr's implementation of memmem. This is the implementation that we hope
+/// does approximately as well as all other implementations, and a lot better
+/// in at least some cases.
+pub(crate) mod krate {
+    pub(crate) fn available(_: &str) -> &'static [&'static str] {
+        &["reverse", "oneshot", "prebuilt", "oneshotiter", "prebuiltiter"]
+    }
+
+    pub(crate) mod fwd {
+        use memchr::memmem;
+
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            memmem::find(haystack.as_bytes(), needle.as_bytes()).is_some()
+        }
+
+        pub(crate) fn prebuilt(
+            needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            let finder = memmem::Finder::new(needle).into_owned();
+            move |h| finder.find(h.as_bytes()).is_some()
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            memmem::find_iter(haystack.as_bytes(), needle.as_bytes())
+        }
+
+        pub(crate) fn prebuiltiter(needle: &str) -> PrebuiltIter {
+            PrebuiltIter(memmem::Finder::new(needle).into_owned())
+        }
+
+        #[derive(Debug)]
+        pub(crate) struct PrebuiltIter(memmem::Finder<'static>);
+
+        impl PrebuiltIter {
+            pub(crate) fn iter<'a>(
+                &'a self,
+                haystack: &'a str,
+            ) -> impl Iterator<Item = usize> + 'a {
+                self.0.find_iter(haystack.as_bytes())
+            }
+        }
+    }
+
+    pub(crate) mod rev {
+        use memchr::memmem;
+
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            memmem::rfind(haystack.as_bytes(), needle.as_bytes()).is_some()
+        }
+
+        pub(crate) fn prebuilt(
+            needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            let finder = memmem::FinderRev::new(needle).into_owned();
+            move |h| finder.rfind(h.as_bytes()).is_some()
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            memmem::rfind_iter(haystack.as_bytes(), needle.as_bytes())
+        }
+
+        pub(crate) fn prebuiltiter(needle: &str) -> PrebuiltIter {
+            PrebuiltIter(memmem::FinderRev::new(needle).into_owned())
+        }
+
+        #[derive(Debug)]
+        pub(crate) struct PrebuiltIter(memmem::FinderRev<'static>);
+
+        impl PrebuiltIter {
+            pub(crate) fn iter<'a>(
+                &'a self,
+                haystack: &'a str,
+            ) -> impl Iterator<Item = usize> + 'a {
+                self.0.rfind_iter(haystack.as_bytes())
+            }
+        }
+    }
+}
+
+/// memchr's implementation of memmem, but without prefilters enabled. This
+/// exists because sometimes prefilters aren't the right choice, and it's good
+/// to be able to compare it against prefilter-accelerated searches to see
+/// where this might be faster.
+pub(crate) mod krate_nopre {
+    pub(crate) fn available(_: &str) -> &'static [&'static str] {
+        &["reverse", "oneshot", "prebuilt", "oneshotiter", "prebuiltiter"]
+    }
+
+    pub(crate) mod fwd {
+        use memchr::memmem;
+
+        fn finder(needle: &[u8]) -> memmem::Finder<'_> {
+            memmem::FinderBuilder::new()
+                .prefilter(memmem::Prefilter::None)
+                .build_forward(needle)
+        }
+
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            finder(needle.as_bytes()).find(haystack.as_bytes()).is_some()
+        }
+
+        pub(crate) fn prebuilt(
+            needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            let finder = finder(needle.as_bytes()).into_owned();
+            move |h| finder.find(h.as_bytes()).is_some()
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            super::super::iter_from_find(
+                haystack.as_bytes(),
+                needle.as_bytes(),
+                |h, n| finder(n).find(h),
+            )
+        }
+
+        pub(crate) fn prebuiltiter(needle: &str) -> PrebuiltIter {
+            PrebuiltIter(finder(needle.as_bytes()).into_owned())
+        }
+
+        #[derive(Debug)]
+        pub(crate) struct PrebuiltIter(memmem::Finder<'static>);
+
+        impl PrebuiltIter {
+            pub(crate) fn iter<'a>(
+                &'a self,
+                haystack: &'a str,
+            ) -> impl Iterator<Item = usize> + 'a {
+                self.0.find_iter(haystack.as_bytes())
+            }
+        }
+    }
+
+    // N.B. memrmem/krate_nopre and memrmem/krate should be equivalent for now
+    // since reverse searching doesn't have any prefilter support.
+    pub(crate) mod rev {
+        use memchr::memmem;
+
+        fn finder(needle: &[u8]) -> memmem::FinderRev<'_> {
+            memmem::FinderBuilder::new()
+                .prefilter(memmem::Prefilter::None)
+                .build_reverse(needle)
+        }
+
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            finder(needle.as_bytes()).rfind(haystack.as_bytes()).is_some()
+        }
+
+        pub(crate) fn prebuilt(
+            needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            let finder = finder(needle.as_bytes()).into_owned();
+            move |h| finder.rfind(h.as_bytes()).is_some()
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            super::super::iter_from_rfind(
+                haystack.as_bytes(),
+                needle.as_bytes(),
+                |h, n| finder(n).rfind(h),
+            )
+        }
+
+        pub(crate) fn prebuiltiter(needle: &str) -> PrebuiltIter {
+            PrebuiltIter(finder(needle.as_bytes()).into_owned())
+        }
+
+        #[derive(Debug)]
+        pub(crate) struct PrebuiltIter(memmem::FinderRev<'static>);
+
+        impl PrebuiltIter {
+            pub(crate) fn iter<'a>(
+                &'a self,
+                haystack: &'a str,
+            ) -> impl Iterator<Item = usize> + 'a {
+                self.0.rfind_iter(haystack.as_bytes())
+            }
+        }
+    }
+}
+
+/// bstr's implementation of memmem.
+///
+/// The implementation in this crate was originally copied from bstr.
+/// Eventually, bstr will just use the implementation in this crate, but at time
+/// of writing, it was useful to benchmark against the "original" version.
+pub(crate) mod bstr {
+    pub(crate) fn available(_: &str) -> &'static [&'static str] {
+        &["reverse", "oneshot", "prebuilt", "oneshotiter", "prebuiltiter"]
+    }
+
+    pub(crate) mod fwd {
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            bstr::ByteSlice::find(haystack.as_bytes(), needle.as_bytes())
+                .is_some()
+        }
+
+        pub(crate) fn prebuilt(
+            needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            let finder = bstr::Finder::new(needle).into_owned();
+            move |h| finder.find(h.as_bytes()).is_some()
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            bstr::ByteSlice::find_iter(haystack.as_bytes(), needle.as_bytes())
+        }
+
+        pub(crate) fn prebuiltiter(needle: &str) -> PrebuiltIter {
+            PrebuiltIter(bstr::Finder::new(needle).into_owned())
+        }
+
+        #[derive(Debug)]
+        pub(crate) struct PrebuiltIter(bstr::Finder<'static>);
+
+        impl PrebuiltIter {
+            pub(crate) fn iter<'a>(
+                &'a self,
+                haystack: &'a str,
+            ) -> impl Iterator<Item = usize> + 'a {
+                super::super::iter_from_find(
+                    haystack.as_bytes(),
+                    self.0.needle(),
+                    move |h, _| self.0.find(h),
+                )
+            }
+        }
+    }
+
+    pub(crate) mod rev {
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            bstr::ByteSlice::rfind(haystack.as_bytes(), needle.as_bytes())
+                .is_some()
+        }
+
+        pub(crate) fn prebuilt(
+            needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            let finder = bstr::FinderReverse::new(needle).into_owned();
+            move |h| finder.rfind(h.as_bytes()).is_some()
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            bstr::ByteSlice::rfind_iter(haystack.as_bytes(), needle.as_bytes())
+        }
+
+        pub(crate) fn prebuiltiter(needle: &str) -> PrebuiltIter {
+            PrebuiltIter(bstr::FinderReverse::new(needle).into_owned())
+        }
+
+        #[derive(Debug)]
+        pub(crate) struct PrebuiltIter(bstr::FinderReverse<'static>);
+
+        impl PrebuiltIter {
+            pub(crate) fn iter<'a>(
+                &'a self,
+                haystack: &'a str,
+            ) -> impl Iterator<Item = usize> + 'a {
+                super::super::iter_from_rfind(
+                    haystack.as_bytes(),
+                    self.0.needle(),
+                    move |h, _| self.0.rfind(h),
+                )
+            }
+        }
+    }
+}
+
+/// regex's implementation of substring search.
+///
+/// regex is where the concept of using heuristics based on an a priori
+/// assumption of byte frequency originated. Eventually, regex will just use the
+/// implementation in this crate, but it will still be useful to benchmark since
+/// regex tends to have higher latency. It would be good to measure that.
+///
+/// For regex, we don't provide oneshots, since that requires compiling the
+/// regex which we know is going to be ridiculously slow. No real need to
+/// measure it I think.
+pub(crate) mod regex {
+    pub(crate) fn available(_: &str) -> &'static [&'static str] {
+        &["prebuilt", "prebuiltiter"]
+    }
+
+    pub(crate) mod fwd {
+        pub(crate) fn oneshot(_haystack: &str, _needle: &str) -> bool {
+            unimplemented!("regex does not support oneshot searches")
+        }
+
+        pub(crate) fn prebuilt(
+            needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            let finder = regex::Regex::new(&regex::escape(needle)).unwrap();
+            move |h| finder.is_match(h)
+        }
+
+        pub(crate) fn oneshotiter(
+            _haystack: &str,
+            _needle: &str,
+        ) -> impl Iterator<Item = usize> + 'static {
+            std::iter::from_fn(move || {
+                unimplemented!("regex does not support oneshot searches")
+            })
+        }
+
+        pub(crate) fn prebuiltiter(needle: &str) -> PrebuiltIter {
+            PrebuiltIter(regex::Regex::new(&regex::escape(needle)).unwrap())
+        }
+
+        #[derive(Debug)]
+        pub(crate) struct PrebuiltIter(regex::Regex);
+
+        impl PrebuiltIter {
+            pub(crate) fn iter<'a>(
+                &'a self,
+                haystack: &'a str,
+            ) -> impl Iterator<Item = usize> + 'a {
+                self.0.find_iter(haystack).map(|m| m.start())
+            }
+        }
+    }
+
+    pub(crate) mod rev {
+        pub(crate) fn oneshot(_haystack: &str, _needle: &str) -> bool {
+            unimplemented!("regex does not support reverse searches")
+        }
+
+        pub(crate) fn prebuilt(
+            _needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            |_| unimplemented!("regex does not support reverse searches")
+        }
+
+        pub(crate) fn oneshotiter(
+            _haystack: &str,
+            _needle: &str,
+        ) -> impl Iterator<Item = usize> + 'static {
+            std::iter::from_fn(move || {
+                unimplemented!("regex does not support reverse searches")
+            })
+        }
+
+        pub(crate) fn prebuiltiter(_needle: &str) -> super::super::NoIter {
+            unimplemented!("regex does not support reverse searches")
+        }
+    }
+}
+
+/// std's substring search implementation.
+///
+/// std uses Two-Way like this crate, but doesn't have any prefilter
+/// heuristics.
+///
+/// std doesn't have any way to amortize the construction of the searcher, so
+/// we can't implement any of the prebuilt routines.
+pub(crate) mod stud {
+    pub(crate) fn available(_: &str) -> &'static [&'static str] {
+        &["reverse", "oneshot", "oneshotiter"]
+    }
+
+    pub(crate) mod fwd {
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            haystack.contains(needle)
+        }
+
+        pub(crate) fn prebuilt(
+            _needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            |_| unimplemented!("std does not support prebuilt searches")
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            haystack.match_indices(needle).map(|(i, _)| i)
+        }
+
+        pub(crate) fn prebuiltiter(_needle: &str) -> super::super::NoIter {
+            super::super::NoIter { imp: "std" }
+        }
+    }
+
+    pub(crate) mod rev {
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            haystack.contains(needle)
+        }
+
+        pub(crate) fn prebuilt(
+            _needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            |_| unimplemented!("std does not support prebuilt searches")
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            haystack.rmatch_indices(needle).map(|(i, _)| i)
+        }
+
+        pub(crate) fn prebuiltiter(_needle: &str) -> super::super::NoIter {
+            super::super::NoIter { imp: "std" }
+        }
+    }
+}
+
+/// Substring search from the twoway crate.
+///
+/// twoway uses, obviously, Two-Way as an implementation. AIUI, it was taken
+/// from std at some point but heavily modified to support a prefilter via
+/// PCMPESTRI from the SSE 4.2 ISA extension. (And also uses memchr for
+/// single-byte needles.)
+///
+/// Like std, there is no way to amortize the construction of the searcher, so
+/// we can't implement any of the prebuilt routines.
+pub(crate) mod twoway {
+    pub(crate) fn available(_: &str) -> &'static [&'static str] {
+        &["reverse", "oneshot", "oneshotiter"]
+    }
+
+    pub(crate) mod fwd {
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            twoway::find_bytes(haystack.as_bytes(), needle.as_bytes())
+                .is_some()
+        }
+
+        pub(crate) fn prebuilt(
+            _needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            |_| unimplemented!("twoway does not support prebuilt searches")
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            super::super::iter_from_find(
+                haystack.as_bytes(),
+                needle.as_bytes(),
+                twoway::find_bytes,
+            )
+        }
+
+        pub(crate) fn prebuiltiter(_needle: &str) -> super::super::NoIter {
+            super::super::NoIter { imp: "twoway" }
+        }
+    }
+
+    pub(crate) mod rev {
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            twoway::rfind_bytes(haystack.as_bytes(), needle.as_bytes())
+                .is_some()
+        }
+
+        pub(crate) fn prebuilt(
+            _needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            |_| unimplemented!("twoway does not support prebuilt searches")
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            super::super::iter_from_rfind(
+                haystack.as_bytes(),
+                needle.as_bytes(),
+                twoway::rfind_bytes,
+            )
+        }
+
+        pub(crate) fn prebuiltiter(_needle: &str) -> super::super::NoIter {
+            super::super::NoIter { imp: "twoway" }
+        }
+    }
+}
+
+/// Substring search from the sliceslice crate.
+///
+/// This crate is what inspired me to write a vectorized memmem implementation
+/// in the memchr crate in the first place. In particular, it exposed some
+/// serious weaknesses in my implementation in the bstr crate.
+///
+/// sliceslice doesn't actually do anything "new" other
+/// than bringing a long known SIMD algorithm to Rust:
+/// http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
+///
+/// The main thrust of the algorithm is that it picks a couple of bytes in the
+/// needle and uses SIMD to check whether those two bytes occur in the haystack
+/// in a way that could lead to a match. If so, then you do a simple memcmp
+/// confirmation step. The main problem with this algorithm is that its worst
+/// case is multiplicative: that confirmatory step can become quite costly if
+/// the SIMD prefilter isn't effective. The elegance of this method, however,
+/// is that the prefilter is routinely effective.
+///
+/// The essence of memchr's implementation of memmem comes from sliceslice,
+/// but also from regex's original idea to use heuristics based on an a priori
+/// assumption of relative byte frequency AND from bstr's desire to have a
+/// constant space and worst case O(m+n) substring search. My claim is that
+/// it is the best of all words, and that's why this benchmark suite is so
+/// comprehensive. There are a lot of cases and implementations to test.
+///
+/// NOTE: The API of sliceslice is quite constrained. My guess is that it was
+/// designed for a very specific use case, and the API is heavily constrained
+/// to that use case (whatever it is). While its API doesn't provide any
+/// oneshot routines, we emulate them. (Its main problem is that every such
+/// search requires copying the needle into a fresh allocation. The memchr
+/// crate avoids that problem by being generic over the needle: it can be owned
+/// or borrowed.) Also, since the API only enables testing whether a substring
+/// exists or not, we can't benchmark iteration.
+///
+/// NOTE: sliceslice only works on x86_64 CPUs with AVX enabled. So not only
+/// do we conditionally compile the routines below, but we only run these
+/// benchmarks when AVX2 is available.
+#[cfg(target_arch = "x86_64")]
+pub(crate) mod sliceslice {
+    pub(crate) fn available(needle: &str) -> &'static [&'static str] {
+        // Apparently sliceslice doesn't support searching with an empty
+        // needle. Sheesh.
+        if !needle.is_empty() && is_x86_feature_detected!("avx2") {
+            &["oneshot", "prebuilt"]
+        } else {
+            &[]
+        }
+    }
+
+    pub(crate) mod fwd {
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            prebuilt(needle)(haystack)
+        }
+
+        pub(crate) fn prebuilt(
+            needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            if !is_x86_feature_detected!("avx2") {
+                unreachable!("sliceslice cannot be called without avx2");
+            }
+            let needle = needle.as_bytes().to_owned().into_boxed_slice();
+            // SAFETY: This code path is only entered when AVX2 is enabled,
+            // which is the only requirement for using DynamicAvx2Searcher.
+            unsafe {
+                let finder = sliceslice::x86::DynamicAvx2Searcher::new(needle);
+                move |h| finder.search_in(h.as_bytes())
+            }
+        }
+
+        pub(crate) fn oneshotiter(
+            _haystack: &str,
+            _needle: &str,
+        ) -> impl Iterator<Item = usize> + 'static {
+            std::iter::from_fn(move || {
+                unimplemented!("sliceslice doesn't not support iteration")
+            })
+        }
+
+        pub(crate) fn prebuiltiter(_needle: &str) -> super::super::NoIter {
+            unimplemented!("sliceslice doesn't support prebuilt iteration")
+        }
+    }
+
+    pub(crate) mod rev {
+        pub(crate) fn oneshot(_haystack: &str, _needle: &str) -> bool {
+            unimplemented!("sliceslice does not support reverse searches")
+        }
+
+        pub(crate) fn prebuilt(
+            _needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            |_| unimplemented!("sliceslice does not support reverse searches")
+        }
+
+        pub(crate) fn oneshotiter(
+            _haystack: &str,
+            _needle: &str,
+        ) -> impl Iterator<Item = usize> + 'static {
+            std::iter::from_fn(move || {
+                unimplemented!("sliceslice does not support reverse searches")
+            })
+        }
+
+        pub(crate) fn prebuiltiter(_needle: &str) -> super::super::NoIter {
+            unimplemented!("sliceslice does not support reverse searches")
+        }
+    }
+}
+
+#[cfg(not(target_arch = "x86_64"))]
+pub(crate) mod sliceslice {
+    pub(crate) fn available(_: &str) -> &'static [&'static str] {
+        &[]
+    }
+
+    pub(crate) mod fwd {
+        pub(crate) fn oneshot(_: &str, _: &str) -> bool {
+            unimplemented!("sliceslice only runs on x86")
+        }
+
+        pub(crate) fn prebuilt(_: &str) -> impl Fn(&str) -> bool + 'static {
+            unimplemented!("sliceslice only runs on x86")
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'static {
+            std::iter::from_fn(move || {
+                unimplemented!("sliceslice only runs on x86")
+            })
+        }
+
+        pub(crate) fn prebuiltiter(needle: &str) -> super::super::NoIter {
+            unimplemented!("sliceslice only runs on x86")
+        }
+    }
+
+    pub(crate) mod rev {
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            unimplemented!("sliceslice does not support reverse searches")
+        }
+
+        pub(crate) fn prebuilt(
+            needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            |_| unimplemented!("sliceslice does not support reverse searches")
+        }
+
+        pub(crate) fn oneshotiter(
+            haystack: &str,
+            needle: &str,
+        ) -> impl Iterator<Item = usize> + 'static {
+            std::iter::from_fn(move || {
+                unimplemented!("sliceslice does not support reverse searches")
+            })
+        }
+
+        pub(crate) fn prebuiltiter(needle: &str) -> super::super::NoIter {
+            unimplemented!("sliceslice does not support reverse searches")
+        }
+    }
+}
+
+/// libc's substring search implementation.
+///
+/// libc doesn't have any way to amortize the construction of the searcher, so
+/// we can't implement any of the prebuilt routines.
+pub(crate) mod libc {
+    pub(crate) fn available(_: &str) -> &'static [&'static str] {
+        &["oneshot", "oneshotiter"]
+    }
+
+    pub(crate) mod fwd {
+        fn find(haystack: &[u8], needle: &[u8]) -> Option<usize> {
+            let p = unsafe {
+                libc::memmem(
+                    haystack.as_ptr() as *const libc::c_void,
+                    haystack.len(),
+                    needle.as_ptr() as *const libc::c_void,
+                    needle.len(),
+                )
+            };
+            if p.is_null() {
+                None
+            } else {
+                Some(p as usize - (haystack.as_ptr() as usize))
+            }
+        }
+
+        pub(crate) fn oneshot(haystack: &str, needle: &str) -> bool {
+            find(haystack.as_bytes(), needle.as_bytes()).is_some()
+        }
+
+        pub(crate) fn prebuilt(
+            _needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            |_| unimplemented!("std does not support prebuilt searches")
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            haystack: &'a str,
+            needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            super::super::iter_from_find(
+                haystack.as_bytes(),
+                needle.as_bytes(),
+                find,
+            )
+        }
+
+        pub(crate) fn prebuiltiter(_needle: &str) -> super::super::NoIter {
+            super::super::NoIter { imp: "libc" }
+        }
+    }
+
+    pub(crate) mod rev {
+        pub(crate) fn oneshot(_haystack: &str, _needle: &str) -> bool {
+            unimplemented!("libc does not support reverse searches")
+        }
+
+        pub(crate) fn prebuilt(
+            _needle: &str,
+        ) -> impl Fn(&str) -> bool + 'static {
+            |_| unimplemented!("libc does not support reverse searches")
+        }
+
+        pub(crate) fn oneshotiter<'a>(
+            _haystack: &'a str,
+            _needle: &'a str,
+        ) -> impl Iterator<Item = usize> + 'a {
+            std::iter::from_fn(move || {
+                unimplemented!("libc does not support reverse searches")
+            })
+        }
+
+        pub(crate) fn prebuiltiter(_needle: &str) -> super::super::NoIter {
+            unimplemented!("libc does not support reverse searches")
+        }
+    }
+}
+
+/// An iterator that looks like a PrebuilIter API-wise, but panics if it's
+/// called. This should be used for implementations that don't support
+/// prebuilt iteration.
+#[derive(Debug)]
+pub(crate) struct NoIter {
+    /// The name of the impl to use in the panic message in case it is invoked
+    /// by mistake. (But the benchmark harness should not invoke it, assuming
+    /// each impl's 'available' function is correct.
+    imp: &'static str,
+}
+
+impl NoIter {
+    pub(crate) fn iter(
+        &self,
+        _: &str,
+    ) -> impl Iterator<Item = usize> + 'static {
+        let imp = self.imp;
+        std::iter::from_fn(move || {
+            unimplemented!("{} does not support prebuilt iteration", imp)
+        })
+    }
+}
+
+/// Accepts a corpus and a needle and a routine that implements substring
+/// search, and returns an iterator over all matches. This is useful for
+/// benchmarking "find all matches" for substring search implementations that
+/// don't expose a native way to do this.
+///
+/// The closure given takes two parameters: the corpus and needle, in that
+/// order.
+fn iter_from_find<'a>(
+    haystack: &'a [u8],
+    needle: &'a [u8],
+    mut find: impl FnMut(&[u8], &[u8]) -> Option<usize> + 'a,
+) -> impl Iterator<Item = usize> + 'a {
+    let mut pos = 0;
+    std::iter::from_fn(move || {
+        if pos > haystack.len() {
+            return None;
+        }
+        match find(&haystack[pos..], needle) {
+            None => None,
+            Some(i) => {
+                let found = pos + i;
+                // We always need to add at least 1, in case of an empty needle.
+                pos += i + std::cmp::max(1, needle.len());
+                Some(found)
+            }
+        }
+    })
+}
+
+/// Like iter_from_find, but for reverse searching.
+fn iter_from_rfind<'a>(
+    haystack: &'a [u8],
+    needle: &'a [u8],
+    mut rfind: impl FnMut(&[u8], &[u8]) -> Option<usize> + 'a,
+) -> impl Iterator<Item = usize> + 'a {
+    let mut pos = Some(haystack.len());
+    std::iter::from_fn(move || {
+        let end = match pos {
+            None => return None,
+            Some(end) => end,
+        };
+        match rfind(&haystack[..end], needle) {
+            None => None,
+            Some(i) => {
+                if end == i {
+                    // We always need to subtract at least 1, in case of an
+                    // empty needle.
+                    pos = end.checked_sub(1);
+                } else {
+                    pos = Some(i);
+                }
+                Some(i)
+            }
+        }
+    })
+}
@@ -0,0 +1,257 @@
+use crate::data;
+
+#[derive(Clone, Copy, Debug)]
+pub struct Input {
+    /// A name describing the corpus, used to identify it in benchmarks.
+    pub name: &'static str,
+    /// The haystack to search.
+    pub corpus: &'static str,
+    /// Queries that are expected to never occur.
+    pub never: &'static [Query],
+    /// Queries that are expected to occur rarely.
+    pub rare: &'static [Query],
+    /// Queries that are expected to fairly common.
+    pub common: &'static [Query],
+}
+
+/// A substring search query for a particular haystack.
+#[derive(Clone, Copy, Debug)]
+pub struct Query {
+    /// A name for this query, used to identify it in benchmarks.
+    pub name: &'static str,
+    /// The needle to search for.
+    pub needle: &'static str,
+    /// The expected number of occurrences.
+    pub count: usize,
+}
+
+pub const INPUTS: &'static [Input] = &[
+    Input {
+        name: "code-rust-library",
+        corpus: data::CODE_RUST_LIBRARY,
+        never: &[
+            Query { name: "fn-strength", needle: "fn strength", count: 0 },
+            Query {
+                name: "fn-strength-paren",
+                needle: "fn strength(",
+                count: 0,
+            },
+            Query { name: "fn-quux", needle: "fn quux(", count: 0 },
+        ],
+        rare: &[
+            Query {
+                name: "fn-from-str",
+                needle: "pub fn from_str(",
+                count: 1,
+            },
+        ],
+        common: &[
+            Query { name: "fn-is-empty", needle: "fn is_empty(", count: 17 },
+            Query { name: "fn", needle: "fn", count: 2985 },
+            Query { name: "paren", needle: "(", count: 30193 },
+            Query { name: "let", needle: "let", count: 4737 },
+        ],
+    },
+    Input {
+        name: "huge-en",
+        corpus: data::SUBTITLE_EN_HUGE,
+        never: &[
+            Query { name: "john-watson", needle: "John Watson", count: 0 },
+            Query { name: "all-common-bytes", needle: "sternness", count: 0 },
+            Query { name: "some-rare-bytes", needle: "quartz", count: 0 },
+            Query { name: "two-space", needle: "  ", count: 0 },
+        ],
+        rare: &[
+            Query {
+                name: "sherlock-holmes",
+                needle: "Sherlock Holmes",
+                count: 1,
+            },
+            Query { name: "sherlock", needle: "Sherlock", count: 1 },
+            Query {
+                name: "medium-needle",
+                needle: "homer, marge, bart, lisa, maggie",
+                count: 1,
+            },
+            Query {
+                name: "long-needle",
+                needle: "I feel afraid of Mostafa\nHe is stronger and older than I am, and more experienced\nShould I turn back?\nDoc you're beginning to sound like Sherlock Holmes.",
+                count: 1,
+            },
+            Query {
+                name: "huge-needle",
+                needle: "Since we will meet anyway, then the sooner, the better\nTomorrow at 4:30 in front of the Horse-Riding Club\nNo, 4:30\nI am confused, almost lost\nAs if an invisible hand pushed me towards an unknown fate\nI needed someone by my side\nI needed someone to guide me to the path of security\nBut I had no one\nI couldn't ask my father's opinion, nor his wife's\nI felt just as lonely as I had before\nI feel afraid of Mostafa\nHe is stronger and older than I am, and more experienced\nShould I turn back?\nDoc you're beginning to sound like Sherlock Holmes.",
+                count: 1,
+            },
+        ],
+        common: &[
+            Query { name: "that", needle: "that", count: 865 },
+            Query { name: "one-space", needle: " ", count: 96606 },
+            Query { name: "you", needle: "you", count: 5009 },
+            // It would be nice to benchmark this case, although it's not
+            // terribly important. The problem is that std's substring
+            // implementation (correctly) never returns match offsets that
+            // split an encoded codepoint, where as memmem on bytes will. So
+            // the counts differ. We could modify our harness to skip this on
+            // std, but it seems like much ado about nothing.
+            // Query { name: "empty", needle: "", count: 613655 },
+        ],
+    },
+    Input {
+        name: "huge-ru",
+        corpus: data::SUBTITLE_RU_HUGE,
+        never: &[Query {
+            name: "john-watson",
+            needle: "Джон Уотсон",
+            count: 0,
+        }],
+        rare: &[
+            Query {
+                name: "sherlock-holmes",
+                needle: "Шерлок Холмс",
+                count: 1,
+            },
+            Query { name: "sherlock", needle: "Шерлок", count: 1 },
+        ],
+        common: &[
+            Query { name: "that", needle: "что", count: 998 },
+            Query { name: "not", needle: "не", count: 3092 },
+            Query { name: "one-space", needle: " ", count: 46941 },
+        ],
+    },
+    Input {
+        name: "huge-zh",
+        corpus: data::SUBTITLE_ZH_HUGE,
+        never: &[Query {
+            name: "john-watson", needle: "约翰·沃森", count: 0
+        }],
+        rare: &[
+            Query {
+                name: "sherlock-holmes",
+                needle: "夏洛克·福尔摩斯",
+                count: 1,
+            },
+            Query { name: "sherlock", needle: "夏洛克", count: 1 },
+        ],
+        common: &[
+            Query { name: "that", needle: "那", count: 1056 },
+            Query { name: "do-not", needle: "不", count: 2751 },
+            Query { name: "one-space", needle: " ", count: 17232 },
+        ],
+    },
+    Input {
+        name: "teeny-en",
+        corpus: data::SUBTITLE_EN_TEENY,
+        never: &[
+            Query { name: "john-watson", needle: "John Watson", count: 0 },
+            Query { name: "all-common-bytes", needle: "sternness", count: 0 },
+            Query { name: "some-rare-bytes", needle: "quartz", count: 0 },
+            Query { name: "two-space", needle: "  ", count: 0 },
+        ],
+        rare: &[
+            Query {
+                name: "sherlock-holmes",
+                needle: "Sherlock Holmes",
+                count: 1,
+            },
+            Query { name: "sherlock", needle: "Sherlock", count: 1 },
+        ],
+        common: &[],
+    },
+    Input {
+        name: "teeny-ru",
+        corpus: data::SUBTITLE_RU_TEENY,
+        never: &[Query {
+            name: "john-watson",
+            needle: "Джон Уотсон",
+            count: 0,
+        }],
+        rare: &[
+            Query {
+                name: "sherlock-holmes",
+                needle: "Шерлок Холмс",
+                count: 1,
+            },
+            Query { name: "sherlock", needle: "Шерлок", count: 1 },
+        ],
+        common: &[],
+    },
+    Input {
+        name: "teeny-zh",
+        corpus: data::SUBTITLE_ZH_TEENY,
+        never: &[Query {
+            name: "john-watson", needle: "约翰·沃森", count: 0
+        }],
+        rare: &[
+            Query {
+                name: "sherlock-holmes",
+                needle: "夏洛克·福尔摩斯",
+                count: 1,
+            },
+            Query { name: "sherlock", needle: "夏洛克", count: 1 },
+        ],
+        common: &[],
+    },
+    Input {
+        name: "pathological-md5-huge",
+        corpus: data::PATHOLOGICAL_MD5_HUGE,
+        never: &[Query {
+            name: "no-hash",
+            needle: "61a1a40effcf97de24505f154a306597",
+            count: 0,
+        }],
+        rare: &[Query {
+            name: "last-hash",
+            needle: "831df319d8597f5bc793d690f08b159b",
+            count: 1,
+        }],
+        common: &[Query { name: "two-bytes", needle: "fe", count: 520 }],
+    },
+    Input {
+        name: "pathological-repeated-rare-huge",
+        corpus: data::PATHOLOGICAL_REPEATED_RARE_HUGE,
+        never: &[Query { name: "tricky", needle: "abczdef", count: 0 }],
+        rare: &[],
+        common: &[Query { name: "match", needle: "zzzzzzzzzz", count: 50010 }],
+    },
+    Input {
+        name: "pathological-repeated-rare-small",
+        corpus: data::PATHOLOGICAL_REPEATED_RARE_SMALL,
+        never: &[Query { name: "tricky", needle: "abczdef", count: 0 }],
+        rare: &[],
+        common: &[Query { name: "match", needle: "zzzzzzzzzz", count: 100 }],
+    },
+    Input {
+        name: "pathological-defeat-simple-vector",
+        corpus: data::PATHOLOGICAL_DEFEAT_SIMPLE_VECTOR,
+        never: &[],
+        rare: &[Query {
+            name: "alphabet",
+            needle: "qbz",
+            count: 1,
+        }],
+        common: &[],
+    },
+    Input {
+        name: "pathological-defeat-simple-vector-freq",
+        corpus: data::PATHOLOGICAL_DEFEAT_SIMPLE_VECTOR_FREQ,
+        never: &[],
+        rare: &[Query {
+            name: "alphabet",
+            needle: "qjaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaz",
+            count: 1,
+        }],
+        common: &[],
+    },
+    Input {
+        name: "pathological-defeat-simple-vector-repeated",
+        corpus: data::PATHOLOGICAL_DEFEAT_SIMPLE_VECTOR_REPEATED,
+        never: &[],
+        rare: &[Query {
+            name: "alphabet",
+            needle: "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzaz",
+            count: 1,
+        }],
+        common: &[],
+    },
+];
@@ -0,0 +1,383 @@
+/*
+This module defines benchmarks for the memmem family of functions.
+Benchmarking a substring algorithm is particularly difficult, especially
+when implementations (like this one, and others) use heuristics to speed up
+common cases, typically at the expense of less common cases. The job of this
+benchmark suite is to not only highlight the fast common cases, but to also put
+a spotlight on the less common or pathological cases. While some things are
+generally expected to be slower because of these heuristics, the benchmarks
+help us make sure they we don't let things get too slow.
+
+The naming scheme is as follows:
+
+  memr?mem/{impl}/{config}/{corpus}/{needle}
+
+Where {...} is a variable. Variables should never contain slashes. They are as
+follows:
+
+  impl
+    A brief name describing the implementation under test. Possible values:
+
+    krate
+      The implementation provided by this crate.
+    krate-nopre
+      The implementation provided by this crate without prefilters enabled.
+    bstr
+      The implementation provided by the bstr crate.
+      N.B. This is only applicable at time of writing, since bstr will
+      eventually just use this crate.
+    regex
+      The implementation of substring search provided by the regex crate.
+      N.B. This is only applicable at time of writing, since regex will
+      eventually just use this crate.
+    stud
+      The implementation of substring search provided by the standard
+      library. This implementation only works on valid UTF-8 by virtue of
+      how its API is exposed.
+    twoway
+      The implementation of substring search provided by the twoway crate.
+    sliceslice
+      The implementation of substring search provided by the sliceslice crate.
+    libc
+      The implementation of memmem in your friendly neighborhood libc.
+
+    Note that there is also a 'memmem' crate, but it is unmaintained and
+    appears to just be a snapshot of std's implementation at a particular
+    point in time (but exposed in a way to permit it to search arbitrary
+    bytes).
+
+  config
+    This should be a brief description of the configuration of the search. Not
+    all implementations can be benchmarked in all configurations. It depends on
+    the API they expose. Possible values:
+
+    oneshot
+      Executes a single search without pre-building a searcher. That
+      this measurement includes the time it takes to initialize a
+      searcher.
+    prebuilt
+      Executes a single search without measuring the time it takes to
+      build a searcher.
+    iter-oneshot
+      Counts the total number of matches. This measures the time it takes to
+      build the searcher.
+    iter-prebuilt
+      Counts the total number of matches. This does not measure the time it
+      takes to build the searcher.
+
+  corpus
+    A brief name describing the corpus or haystack used in the benchmark. In
+    general, we vary this with regard to size and language. Possible values:
+
+    subtitles-{en,ru,zh}
+      Text from the OpenSubtitles project, in one of English, Russian or
+      Chinese. This is the primary input meant to represent most kinds of
+      haystacks.
+    pathological-{...}
+      A haystack that has been specifically constructed to exploit a
+      pathological case in or more substring search implementations.
+    sliceslice-words
+      The haystack is varied across words in an English dictionary. Using
+      this corpus means the benchmark is measuring performance on very small
+      haystacks. This was taken from the sliceslice crate benchmarks.
+    sliceslice-i386
+      The haystack is an Intel 80386 reference manual.
+      This was also taken from the sliceslice crate benchmarks.
+
+  needle
+    A brief name describing the needle used. Unlike other variables, there
+    isn't a strong controlled vocabularly for this parameter. The needle
+    variable is meant to be largely self explanatory. For example, a needle
+    named "rare" probably means that the number of occurrences of the needle
+    is expected to be particularly low.
+*/
+
+use criterion::Criterion;
+
+use crate::{define, memmem::inputs::INPUTS};
+
+mod imp;
+mod inputs;
+mod sliceslice;
+
+pub fn all(c: &mut Criterion) {
+    oneshot(c);
+    prebuilt(c);
+    oneshot_iter(c);
+    prebuilt_iter(c);
+    sliceslice::all(c);
+}
+
+fn oneshot(c: &mut Criterion) {
+    macro_rules! def_impl {
+        ($inp:expr, $q:expr, $freq:expr, $impl:ident) => {
+            let config = "oneshot";
+            let available = imp::$impl::available($q.needle);
+            // We only define non-iter benchmarks when the count is <=1. Such
+            // queries are usually constructed to only appear at the end.
+            // Otherwise, for more common queries, the benchmark would be
+            // approximately duplicative with benchmarks on shorter haystacks
+            // for the implementations we benchmark.
+            if $q.count <= 1 && available.contains(&config) {
+                let expected = $q.count > 0;
+                macro_rules! define {
+                    ($dir:expr, $find:expr) => {
+                        let name = format!(
+                            "{dir}/{imp}/{config}/{inp}/{freq}-{q}",
+                            dir = $dir,
+                            imp = stringify!($impl),
+                            config = config,
+                            inp = $inp.name,
+                            freq = $freq,
+                            q = $q.name,
+                        );
+                        define(
+                            c,
+                            &name,
+                            $inp.corpus.as_bytes(),
+                            Box::new(move |b| {
+                                b.iter(|| {
+                                    assert_eq!(
+                                        expected,
+                                        $find($inp.corpus, $q.needle)
+                                    );
+                                });
+                            }),
+                        );
+                    };
+                }
+                define!("memmem", imp::$impl::fwd::oneshot);
+                if available.contains(&"reverse") {
+                    define!("memrmem", imp::$impl::rev::oneshot);
+                }
+            }
+        };
+    }
+    macro_rules! def_all_impls {
+        ($inp:expr, $q:expr, $freq:expr) => {
+            def_impl!($inp, $q, $freq, krate);
+            def_impl!($inp, $q, $freq, krate_nopre);
+            def_impl!($inp, $q, $freq, bstr);
+            def_impl!($inp, $q, $freq, regex);
+            def_impl!($inp, $q, $freq, stud);
+            def_impl!($inp, $q, $freq, twoway);
+            def_impl!($inp, $q, $freq, sliceslice);
+            def_impl!($inp, $q, $freq, libc);
+        };
+    }
+    for inp in INPUTS {
+        for q in inp.never {
+            def_all_impls!(inp, q, "never");
+        }
+        for q in inp.rare {
+            def_all_impls!(inp, q, "rare");
+        }
+        for q in inp.common {
+            def_all_impls!(inp, q, "common");
+        }
+    }
+}
+
+fn prebuilt(c: &mut Criterion) {
+    macro_rules! def_impl {
+        ($inp:expr, $q:expr, $freq:expr, $impl:ident) => {
+            let config = "prebuilt";
+            let available = imp::$impl::available($q.needle);
+            // We only define non-iter benchmarks when the count is <=1. Such
+            // queries are usually constructed to only appear at the end.
+            // Otherwise, for more common queries, the benchmark would be
+            // approximately duplicative with benchmarks on shorter haystacks
+            // for the implementations we benchmark.
+            if $q.count <= 1 && available.contains(&config) {
+                let expected = $q.count > 0;
+                macro_rules! define {
+                    ($dir:expr, $new_finder:expr) => {
+                        let name = format!(
+                            "{dir}/{imp}/{config}/{inp}/{freq}-{q}",
+                            dir = $dir,
+                            imp = stringify!($impl),
+                            config = config,
+                            inp = $inp.name,
+                            freq = $freq,
+                            q = $q.name,
+                        );
+                        define(
+                            c,
+                            &name,
+                            $inp.corpus.as_bytes(),
+                            Box::new(move |b| {
+                                let find = $new_finder($q.needle);
+                                b.iter(|| {
+                                    assert_eq!(expected, find($inp.corpus));
+                                });
+                            }),
+                        );
+                    };
+                }
+                define!("memmem", imp::$impl::fwd::prebuilt);
+                if available.contains(&"reverse") {
+                    define!("memrmem", imp::$impl::rev::prebuilt);
+                }
+            }
+        };
+    }
+    macro_rules! def_all_impls {
+        ($inp:expr, $q:expr, $freq:expr) => {
+            def_impl!($inp, $q, $freq, krate);
+            def_impl!($inp, $q, $freq, krate_nopre);
+            def_impl!($inp, $q, $freq, bstr);
+            def_impl!($inp, $q, $freq, regex);
+            def_impl!($inp, $q, $freq, stud);
+            def_impl!($inp, $q, $freq, twoway);
+            def_impl!($inp, $q, $freq, sliceslice);
+            def_impl!($inp, $q, $freq, libc);
+        };
+    }
+    for inp in INPUTS {
+        for q in inp.never {
+            def_all_impls!(inp, q, "never");
+        }
+        for q in inp.rare {
+            def_all_impls!(inp, q, "rare");
+        }
+        for q in inp.common {
+            def_all_impls!(inp, q, "common");
+        }
+    }
+}
+
+fn oneshot_iter(c: &mut Criterion) {
+    macro_rules! def_impl {
+        ($inp:expr, $q:expr, $freq:expr, $impl:ident) => {
+            let config = "oneshotiter";
+            let available = imp::$impl::available($q.needle);
+            // We only define iter benchmarks when the count is >1. Since
+            // queries with count<=1 are usually constructed such that the
+            // match appears at the end of the haystack, it doesn't make much
+            // sense to also benchmark iteration for that case. Instead, we only
+            // benchmark iteration for queries that match more frequently.
+            if $q.count > 1 && available.contains(&config) {
+                macro_rules! define {
+                    ($dir:expr, $find_iter:expr) => {
+                        let name = format!(
+                            "{dir}/{imp}/{config}/{inp}/{freq}-{q}",
+                            dir = $dir,
+                            imp = stringify!($impl),
+                            config = config,
+                            inp = $inp.name,
+                            freq = $freq,
+                            q = $q.name,
+                        );
+                        define(
+                            c,
+                            &name,
+                            $inp.corpus.as_bytes(),
+                            Box::new(move |b| {
+                                b.iter(|| {
+                                    let it =
+                                        $find_iter($inp.corpus, $q.needle);
+                                    assert_eq!($q.count, it.count());
+                                });
+                            }),
+                        );
+                    };
+                }
+                define!("memmem", imp::$impl::fwd::oneshotiter);
+                if available.contains(&"reverse") {
+                    define!("memrmem", imp::$impl::rev::oneshotiter);
+                }
+            }
+        };
+    }
+    macro_rules! def_all_impls {
+        ($inp:expr, $q:expr, $freq:expr) => {
+            def_impl!($inp, $q, $freq, krate);
+            def_impl!($inp, $q, $freq, krate_nopre);
+            def_impl!($inp, $q, $freq, bstr);
+            def_impl!($inp, $q, $freq, regex);
+            def_impl!($inp, $q, $freq, stud);
+            def_impl!($inp, $q, $freq, twoway);
+            def_impl!($inp, $q, $freq, sliceslice);
+            def_impl!($inp, $q, $freq, libc);
+        };
+    }
+    for inp in INPUTS {
+        for q in inp.never {
+            def_all_impls!(inp, q, "never");
+        }
+        for q in inp.rare {
+            def_all_impls!(inp, q, "rare");
+        }
+        for q in inp.common {
+            def_all_impls!(inp, q, "common");
+        }
+    }
+}
+
+fn prebuilt_iter(c: &mut Criterion) {
+    macro_rules! def_impl {
+        ($inp:expr, $q:expr, $freq:expr, $impl:ident) => {
+            let config = "prebuiltiter";
+            let available = imp::$impl::available($q.needle);
+            // We only define iter benchmarks when the count is >1. Since
+            // queries with count<=1 are usually constructed such that the
+            // match appears at the end of the haystack, it doesn't make much
+            // sense to also benchmark iteration for that case. Instead, we only
+            // benchmark iteration for queries that match more frequently.
+            if $q.count > 1 && available.contains(&config) {
+                macro_rules! define {
+                    ($dir:expr, $new_finder:expr) => {
+                        let name = format!(
+                            "{dir}/{imp}/{config}/{inp}/{freq}-{q}",
+                            dir = $dir,
+                            imp = stringify!($impl),
+                            config = config,
+                            inp = $inp.name,
+                            freq = $freq,
+                            q = $q.name,
+                        );
+                        define(
+                            c,
+                            &name,
+                            $inp.corpus.as_bytes(),
+                            Box::new(move |b| {
+                                let finder = $new_finder($q.needle);
+                                b.iter(|| {
+                                    let it = finder.iter($inp.corpus);
+                                    assert_eq!($q.count, it.count());
+                                });
+                            }),
+                        );
+                    };
+                }
+                define!("memmem", imp::$impl::fwd::prebuiltiter);
+                if available.contains(&"reverse") {
+                    define!("memrmem", imp::$impl::rev::prebuiltiter);
+                }
+            }
+        };
+    }
+    macro_rules! def_all_impls {
+        ($inp:expr, $q:expr, $freq:expr) => {
+            def_impl!($inp, $q, $freq, krate);
+            def_impl!($inp, $q, $freq, krate_nopre);
+            def_impl!($inp, $q, $freq, bstr);
+            def_impl!($inp, $q, $freq, regex);
+            def_impl!($inp, $q, $freq, stud);
+            def_impl!($inp, $q, $freq, twoway);
+            def_impl!($inp, $q, $freq, sliceslice);
+            def_impl!($inp, $q, $freq, libc);
+        };
+    }
+    for inp in INPUTS {
+        for q in inp.never {
+            def_all_impls!(inp, q, "never");
+        }
+        for q in inp.rare {
+            def_all_impls!(inp, q, "rare");
+        }
+        for q in inp.common {
+            def_all_impls!(inp, q, "common");
+        }
+    }
+}
@@ -0,0 +1,227 @@
+/*
+These benchmarks were lifted almost verbtaim out of the sliceslice crate. The
+reason why we have these benchmarks is because they were the primary thing that
+motivated me to write this particular memmem implementation. In particular, my
+existing substring search implementation in the bstr crate did quite poorly
+on these particular benchmarks. Moreover, while the benchmark setup is a little
+weird, these benchmarks do reflect cases that I think are somewhat common:
+
+N.B. In the sliceslice crate, the benchmarks are called "short" and "long."
+Here, we call them sliceslice-words/words and sliceslice-i386/words,
+respectively. The name change was made to be consistent with the naming
+convention used for other benchmarks.
+
+* In the sliceslice-words/words case, the benchmark is primarily about
+  searching very short haystacks using common English words.
+* In the sliceslice-words/i386 case, the benchmark is primarily about searching
+  a longer haystack with common English words.
+
+The main thing that's "weird" about these benchmarks is that each iteration
+involves a lot of work. All of the other benchmarks in this crate focus on one
+specific needle with one specific haystack, and each iteration is a single
+search or iteration. But in these benchmarks, each iteration involves searching
+with many needles against potentially many haystacks. Nevertheless, these have
+proven useful targets for optimization.
+*/
+use criterion::{black_box, Criterion};
+use memchr::memmem;
+
+use crate::{data::*, define};
+
+pub fn all(c: &mut Criterion) {
+    search_short_haystack(c);
+    search_long_haystack(c);
+}
+
+fn search_short_haystack(c: &mut Criterion) {
+    let mut words = SLICESLICE_WORDS.lines().collect::<Vec<_>>();
+    words.sort_unstable_by_key(|word| word.len());
+    let words: Vec<&str> = words.iter().map(|&s| s).collect();
+
+    let needles = words.clone();
+    define(
+        c,
+        "memmem/krate/prebuilt/sliceslice-words/words",
+        &[],
+        Box::new(move |b| {
+            let searchers = needles
+                .iter()
+                .map(|needle| memmem::Finder::new(needle.as_bytes()))
+                .collect::<Vec<_>>();
+            b.iter(|| {
+                for (i, searcher) in searchers.iter().enumerate() {
+                    for haystack in &needles[i..] {
+                        black_box(
+                            searcher.find(haystack.as_bytes()).is_some(),
+                        );
+                    }
+                }
+            });
+        }),
+    );
+
+    let needles = words.clone();
+    define(
+        c,
+        "memmem/krate-nopre/prebuilt/sliceslice-words/words",
+        &[],
+        Box::new(move |b| {
+            let searchers = needles
+                .iter()
+                .map(|needle| {
+                    memmem::FinderBuilder::new()
+                        .prefilter(memmem::Prefilter::None)
+                        .build_forward(needle)
+                })
+                .collect::<Vec<_>>();
+            b.iter(|| {
+                for (i, searcher) in searchers.iter().enumerate() {
+                    for haystack in &needles[i..] {
+                        black_box(
+                            searcher.find(haystack.as_bytes()).is_some(),
+                        );
+                    }
+                }
+            });
+        }),
+    );
+
+    let needles = words.clone();
+    define(
+        c,
+        "memmem/std/prebuilt/sliceslice-words/words",
+        &[],
+        Box::new(move |b| {
+            b.iter(|| {
+                for (i, needle) in needles.iter().enumerate() {
+                    for haystack in &needles[i..] {
+                        black_box(haystack.contains(needle));
+                    }
+                }
+            });
+        }),
+    );
+
+    #[cfg(target_arch = "x86_64")]
+    {
+        use sliceslice::x86::DynamicAvx2Searcher;
+
+        let needles = words.clone();
+        define(
+            c,
+            "memmem/sliceslice/prebuilt/sliceslice-words/words",
+            &[],
+            Box::new(move |b| {
+                let searchers = needles
+                    .iter()
+                    .map(|&needle| unsafe {
+                        DynamicAvx2Searcher::new(
+                            needle.as_bytes().to_owned().into_boxed_slice(),
+                        )
+                    })
+                    .collect::<Vec<_>>();
+
+                b.iter(|| {
+                    for (i, searcher) in searchers.iter().enumerate() {
+                        for haystack in &needles[i..] {
+                            black_box(unsafe {
+                                searcher.search_in(haystack.as_bytes())
+                            });
+                        }
+                    }
+                });
+            }),
+        );
+    }
+}
+
+fn search_long_haystack(c: &mut Criterion) {
+    let words: Vec<&str> = SLICESLICE_WORDS.lines().collect();
+    let haystack = SLICESLICE_I386;
+    let needles = words.clone();
+    define(
+        c,
+        "memmem/krate/prebuilt/sliceslice-i386/words",
+        &[],
+        Box::new(move |b| {
+            let searchers = needles
+                .iter()
+                .map(|needle| memmem::Finder::new(needle.as_bytes()))
+                .collect::<Vec<_>>();
+            b.iter(|| {
+                for searcher in searchers.iter() {
+                    black_box(searcher.find(haystack.as_bytes()).is_some());
+                }
+            });
+        }),
+    );
+
+    let haystack = SLICESLICE_I386;
+    let needles = words.clone();
+    define(
+        c,
+        "memmem/krate-nopre/prebuilt/sliceslice-i386/words",
+        &[],
+        Box::new(move |b| {
+            let searchers = needles
+                .iter()
+                .map(|needle| {
+                    memmem::FinderBuilder::new()
+                        .prefilter(memmem::Prefilter::None)
+                        .build_forward(needle)
+                })
+                .collect::<Vec<_>>();
+            b.iter(|| {
+                for searcher in searchers.iter() {
+                    black_box(searcher.find(haystack.as_bytes()).is_some());
+                }
+            });
+        }),
+    );
+
+    let haystack = SLICESLICE_I386;
+    let needles = words.clone();
+    define(
+        c,
+        "memmem/std/prebuilt/sliceslice-i386/words",
+        &[],
+        Box::new(move |b| {
+            b.iter(|| {
+                for needle in needles.iter() {
+                    black_box(haystack.contains(needle));
+                }
+            });
+        }),
+    );
+
+    #[cfg(target_arch = "x86_64")]
+    {
+        use sliceslice::x86::DynamicAvx2Searcher;
+
+        let haystack = SLICESLICE_I386;
+        let needles = words.clone();
+        define(
+            c,
+            "memmem/sliceslice/prebuilt/sliceslice-i386/words",
+            &[],
+            Box::new(move |b| {
+                let searchers = needles
+                    .iter()
+                    .map(|needle| unsafe {
+                        DynamicAvx2Searcher::new(
+                            needle.as_bytes().to_owned().into_boxed_slice(),
+                        )
+                    })
+                    .collect::<Vec<_>>();
+
+                b.iter(|| {
+                    for searcher in &searchers {
+                        black_box(unsafe {
+                            searcher.search_in(haystack.as_bytes())
+                        });
+                    }
+                });
+            }),
+        );
+    }
+}
@@ -0,0 +1,4 @@
+/target
+/corpus
+/artifacts
+/coverage
@@ -0,0 +1,37 @@
+# This file is automatically @generated by Cargo.
+# It is not intended for manual editing.
+version = 3
+
+[[package]]
+name = "arbitrary"
+version = "1.0.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "698b65a961a9d730fb45b6b0327e20207810c9f61ee421b082b27ba003f49e2b"
+
+[[package]]
+name = "cc"
+version = "1.0.67"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e3c69b077ad434294d3ce9f1f6143a2a4b89a8a2d54ef813d85003a4fd1137fd"
+
+[[package]]
+name = "libfuzzer-sys"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "86c975d637bc2a2f99440932b731491fc34c7f785d239e38af3addd3c2fd0e46"
+dependencies = [
+ "arbitrary",
+ "cc",
+]
+
+[[package]]
+name = "memchr"
+version = "2.3.4"
+
+[[package]]
+name = "memchr-fuzz"
+version = "0.0.0"
+dependencies = [
+ "libfuzzer-sys",
+ "memchr",
+]
@@ -0,0 +1,79 @@
+cargo-features = ['named-profiles']
+
+[package]
+publish = false
+name = "memchr-fuzz"
+version = "0.0.0"
+authors = ["Andrew Gallant <jamslam@gmail.com>"]
+edition = "2018"
+
+[package.metadata]
+cargo-fuzz = true
+
+[dependencies]
+libfuzzer-sys = "0.4"
+
+[dependencies.memchr]
+path = ".."
+
+# Prevent this from interfering with workspaces
+[workspace]
+members = ["."]
+
+[[bin]]
+name = "memchr"
+path = "fuzz_targets/memchr.rs"
+test = false
+doc = false
+
+[[bin]]
+name = "memchr2"
+path = "fuzz_targets/memchr2.rs"
+test = false
+doc = false
+
+[[bin]]
+name = "memchr3"
+path = "fuzz_targets/memchr3.rs"
+test = false
+doc = false
+
+[[bin]]
+name = "memrchr"
+path = "fuzz_targets/memrchr.rs"
+test = false
+doc = false
+
+[[bin]]
+name = "memrchr2"
+path = "fuzz_targets/memrchr2.rs"
+test = false
+doc = false
+
+[[bin]]
+name = "memrchr3"
+path = "fuzz_targets/memrchr3.rs"
+test = false
+doc = false
+
+[[bin]]
+name = "memmem"
+path = "fuzz_targets/memmem.rs"
+test = false
+doc = false
+
+[[bin]]
+name = "memrmem"
+path = "fuzz_targets/memrmem.rs"
+test = false
+doc = false
+
+[profile.release]
+opt-level = 3
+debug = true
+
+[profile.debug]
+inherits = "release"
+
+[profile.test]
+inherits = "release"
@@ -0,0 +1,11 @@
+#![no_main]
+
+use libfuzzer_sys::fuzz_target;
+use memchr::memchr_iter;
+
+fuzz_target!(|data: &[u8]| {
+    if data.is_empty() {
+        return;
+    }
+    memchr_iter(data[0], &data[1..]).count();
+});
@@ -0,0 +1,11 @@
+#![no_main]
+
+use libfuzzer_sys::fuzz_target;
+use memchr::memchr2_iter;
+
+fuzz_target!(|data: &[u8]| {
+    if data.len() < 2 {
+        return;
+    }
+    memchr2_iter(data[0], data[1], &data[2..]).count();
+});
@@ -0,0 +1,11 @@
+#![no_main]
+
+use libfuzzer_sys::fuzz_target;
+use memchr::memchr3_iter;
+
+fuzz_target!(|data: &[u8]| {
+    if data.len() < 3 {
+        return;
+    }
+    memchr3_iter(data[0], data[1], data[2], &data[3..]).count();
+});
@@ -0,0 +1,13 @@
+#![no_main]
+
+use libfuzzer_sys::fuzz_target;
+use memchr::memmem;
+
+fuzz_target!(|data: &[u8]| {
+    if data.len() < 2 {
+        return;
+    }
+    let split = std::cmp::max(data[0] as usize, 1) % data.len() as usize;
+    let (needle, haystack) = (&data[..split], &data[split..]);
+    memmem::find_iter(haystack, needle).count();
+});
@@ -0,0 +1,11 @@
+#![no_main]
+
+use libfuzzer_sys::fuzz_target;
+use memchr::memchr_iter;
+
+fuzz_target!(|data: &[u8]| {
+    if data.is_empty() {
+        return;
+    }
+    memchr_iter(data[0], &data[1..]).rev().count();
+});
@@ -0,0 +1,11 @@
+#![no_main]
+
+use libfuzzer_sys::fuzz_target;
+use memchr::memchr2_iter;
+
+fuzz_target!(|data: &[u8]| {
+    if data.len() < 2 {
+        return;
+    }
+    memchr2_iter(data[0], data[1], &data[2..]).rev().count();
+});
@@ -0,0 +1,11 @@
+#![no_main]
+
+use libfuzzer_sys::fuzz_target;
+use memchr::memchr3_iter;
+
+fuzz_target!(|data: &[u8]| {
+    if data.len() < 3 {
+        return;
+    }
+    memchr3_iter(data[0], data[1], data[2], &data[3..]).rev().count();
+});
@@ -0,0 +1,13 @@
+#![no_main]
+
+use libfuzzer_sys::fuzz_target;
+use memchr::memmem;
+
+fuzz_target!(|data: &[u8]| {
+    if data.len() < 2 {
+        return;
+    }
+    let split = std::cmp::max(data[0] as usize, 1) % data.len() as usize;
+    let (needle, haystack) = (&data[..split], &data[split..]);
+    memmem::rfind_iter(haystack, needle).count();
+});
@@ -0,0 +1,97 @@
+use core::ops;
+
+/// A specialized copy-on-write byte string.
+///
+/// The purpose of this type is to permit usage of a "borrowed or owned
+/// byte string" in a way that keeps std/no-std compatibility. That is, in
+/// no-std mode, this type devolves into a simple &[u8] with no owned variant
+/// availble. We can't just use a plain Cow because Cow is not in core.
+#[derive(Clone, Debug)]
+pub struct CowBytes<'a>(Imp<'a>);
+
+// N.B. We don't use std::borrow::Cow here since we can get away with a
+// Box<[u8]> for our use case, which is 1/3 smaller than the Vec<u8> that
+// a Cow<[u8]> would use.
+#[cfg(feature = "std")]
+#[derive(Clone, Debug)]
+enum Imp<'a> {
+    Borrowed(&'a [u8]),
+    Owned(Box<[u8]>),
+}
+
+#[cfg(not(feature = "std"))]
+#[derive(Clone, Debug)]
+struct Imp<'a>(&'a [u8]);
+
+impl<'a> ops::Deref for CowBytes<'a> {
+    type Target = [u8];
+
+    #[inline(always)]
+    fn deref(&self) -> &[u8] {
+        self.as_slice()
+    }
+}
+
+impl<'a> CowBytes<'a> {
+    /// Create a new borrowed CowBytes.
+    #[inline(always)]
+    pub fn new<B: ?Sized + AsRef<[u8]>>(bytes: &'a B) -> CowBytes<'a> {
+        CowBytes(Imp::new(bytes.as_ref()))
+    }
+
+    /// Create a new owned CowBytes.
+    #[cfg(feature = "std")]
+    #[inline(always)]
+    pub fn new_owned(bytes: Box<[u8]>) -> CowBytes<'static> {
+        CowBytes(Imp::Owned(bytes))
+    }
+
+    /// Return a borrowed byte string, regardless of whether this is an owned
+    /// or borrowed byte string internally.
+    #[inline(always)]
+    pub fn as_slice(&self) -> &[u8] {
+        self.0.as_slice()
+    }
+
+    /// Return an owned version of this copy-on-write byte string.
+    ///
+    /// If this is already an owned byte string internally, then this is a
+    /// no-op. Otherwise, the internal byte string is copied.
+    #[cfg(feature = "std")]
+    #[inline(always)]
+    pub fn into_owned(self) -> CowBytes<'static> {
+        match self.0 {
+            Imp::Borrowed(b) => CowBytes::new_owned(Box::from(b)),
+            Imp::Owned(b) => CowBytes::new_owned(b),
+        }
+    }
+}
+
+impl<'a> Imp<'a> {
+    #[cfg(feature = "std")]
+    #[inline(always)]
+    pub fn new(bytes: &'a [u8]) -> Imp<'a> {
+        Imp::Borrowed(bytes)
+    }
+
+    #[cfg(not(feature = "std"))]
+    #[inline(always)]
+    pub fn new(bytes: &'a [u8]) -> Imp<'a> {
+        Imp(bytes)
+    }
+
+    #[cfg(feature = "std")]
+    #[inline(always)]
+    pub fn as_slice(&self) -> &[u8] {
+        match self {
+            Imp::Owned(ref x) => x,
+            Imp::Borrowed(x) => x,
+        }
+    }
+
+    #[cfg(not(feature = "std"))]
+    #[inline(always)]
+    pub fn as_slice(&self) -> &[u8] {
+        self.0
+    }
+}
@@ -1,28 +1,163 @@
 /*!
-The `memchr` crate provides heavily optimized routines for searching bytes.
+This library provides heavily optimized routines for string search primitives.

-The `memchr` function is traditionally provided by libc, however, the
-performance of `memchr` can vary significantly depending on the specific
-implementation of libc that is used. They can range from manually tuned
-Assembly implementations (like that found in GNU's libc) all the way to
-non-vectorized C implementations (like that found in MUSL).
+# Overview

-To smooth out the differences between implementations of libc, at least
-on `x86_64` for Rust 1.27+, this crate provides its own implementation of
-`memchr` that should perform competitively with the one found in GNU's libc.
-The implementation is in pure Rust and has no dependency on a C compiler or an
-Assembler.
+This section gives a brief high level overview of what this crate offers.

-Additionally, GNU libc also provides an extension, `memrchr`. This crate
-provides its own implementation of `memrchr` as well, on top of `memchr2`,
-`memchr3`, `memrchr2` and `memrchr3`. The difference between `memchr` and
-`memchr2` is that that `memchr2` permits finding all occurrences of two bytes
-instead of one. Similarly for `memchr3`.
+* The top-level module provides routines for searching for 1, 2 or 3 bytes
+  in the forward or reverse direction. When searching for more than one byte,
+  positions are considered a match if the byte at that position matches any
+  of the bytes.
+* The [`memmem`] sub-module provides forward and reverse substring search
+  routines.
+
+In all such cases, routines operate on `&[u8]` without regard to encoding. This
+is exactly what you want when searching either UTF-8 or arbitrary bytes.
+
+# Example: using `memchr`
+
+This example shows how to use `memchr` to find the first occurrence of `z` in
+a haystack:
+
+```
+use memchr::memchr;
+
+let haystack = b"foo bar baz quuz";
+assert_eq!(Some(10), memchr(b'z', haystack));
+```
+
+# Example: matching one of three possible bytes
+
+This examples shows how to use `memrchr3` to find occurrences of `a`, `b` or
+`c`, starting at the end of the haystack.
+
+```
+use memchr::memchr3_iter;
+
+let haystack = b"xyzaxyzbxyzc";
+
+let mut it = memchr3_iter(b'a', b'b', b'c', haystack).rev();
+assert_eq!(Some(11), it.next());
+assert_eq!(Some(7), it.next());
+assert_eq!(Some(3), it.next());
+assert_eq!(None, it.next());
+```
+
+# Example: iterating over substring matches
+
+This example shows how to use the [`memmem`] sub-module to find occurrences of
+a substring in a haystack.
+
+```
+use memchr::memmem;
+
+let haystack = b"foo bar foo baz foo";
+
+let mut it = memmem::find_iter(haystack, "foo");
+assert_eq!(Some(0), it.next());
+assert_eq!(Some(8), it.next());
+assert_eq!(Some(16), it.next());
+assert_eq!(None, it.next());
+```
+
+# Example: repeating a search for the same needle
+
+It may be possible for the overhead of constructing a substring searcher to be
+measurable in some workloads. In cases where the same needle is used to search
+many haystacks, it is possible to do construction once and thus to avoid it for
+subsequent searches. This can be done with a [`memmem::Finder`]:
+
+```
+use memchr::memmem;
+
+let finder = memmem::Finder::new("foo");
+
+assert_eq!(Some(4), finder.find(b"baz foo quux"));
+assert_eq!(None, finder.find(b"quux baz bar"));
+```
+
+# Why use this crate?
+
+At first glance, the APIs provided by this crate might seem weird. Why provide
+a dedicated routine like `memchr` for something that could be implemented
+clearly and trivially in one line:
+
+```
+fn memchr(needle: u8, haystack: &[u8]) -> Option<usize> {
+    haystack.iter().position(|&b| b == needle)
+}
+```
+
+Or similarly, why does this crate provide substring search routines when Rust's
+core library already provides them?
+
+```
+fn search(haystack: &str, needle: &str) -> Option<usize> {
+    haystack.find(needle)
+}
+```
+
+The primary reason for both of them to exist is performance. When it comes to
+performance, at a high level at least, there are two primary ways to look at
+it:
+
+* **Throughput**: For this, think about it as, "given some very large haystack
+  and a byte that never occurs in that haystack, how long does it take to
+  search through it and determine that it, in fact, does not occur?"
+* **Latency**: For this, think about it as, "given a tiny haystack---just a
+  few bytes---how long does it take to determine if a byte is in it?"
+
+The `memchr` routine in this crate has _slightly_ worse latency than the
+solution presented above, however, its throughput can easily be over an
+order of magnitude faster. This is a good general purpose trade off to make.
+You rarely lose, but often gain big.
+
+**NOTE:** The name `memchr` comes from the corresponding routine in libc. A key
+advantage of using this library is that its performance is not tied to its
+quality of implementation in the libc you happen to be using, which can vary
+greatly from platform to platform.
+
+But what about substring search? This one is a bit more complicated. The
+primary reason for its existence is still indeed performance, but it's also
+useful because Rust's core library doesn't actually expose any substring
+search routine on arbitrary bytes. The only substring search routine that
+exists works exclusively on valid UTF-8.
+
+So if you have valid UTF-8, is there a reason to use this over the standard
+library substring search routine? Yes. This routine is faster on almost every
+metric, including latency. The natural question then, is why isn't this
+implementation in the standard library, even if only for searching on UTF-8?
+The reason is that the implementation details for using SIMD in the standard
+library haven't quite been worked out yet.
+
+**NOTE:** Currently, only `x86_64` targets have highly accelerated
+implementations of substring search. For `memchr`, all targets have
+somewhat-accelerated implementations, while only `x86_64` targets have highly
+accelerated implementations. This limitation is expected to be lifted once the
+standard library exposes a platform independent SIMD API.
+
+# Crate features
+
+* **std** - When enabled (the default), this will permit this crate to use
+  features specific to the standard library. Currently, the only thing used
+  from the standard library is runtime SIMD CPU feature detection. This means
+  that this feature must be enabled to get AVX accelerated routines. When
+  `std` is not enabled, this crate will still attempt to use SSE2 accelerated
+  routines on `x86_64`.
+* **libc** - When enabled (**not** the default), this library will use your
+  platform's libc implementation of `memchr` (and `memrchr` on Linux). This
+  can be useful on non-`x86_64` targets where the fallback implementation in
+  this crate is not as good as the one found in your libc. All other routines
+  (e.g., `memchr[23]` and substring search) unconditionally use the
+  implementation in this crate.
 */

-#![cfg_attr(not(feature = "std"), no_std)]
 #![deny(missing_docs)]
-#![doc(html_root_url = "https://docs.rs/memchr/2.0.0")]
+#![cfg_attr(not(feature = "std"), no_std)]
+// It's not worth trying to gate all code on just miri, so turn off relevant
+// dead code warnings.
+#![cfg_attr(miri, allow(dead_code, unused_macros))]

 // Supporting 8-bit (or others) would be fine. If you need it, please submit a
 // bug report at https://github.com/BurntSushi/rust-memchr
@@ -33,409 +168,14 @@ instead of one. Similarly for `memchr3`.
 )))]
 compile_error!("memchr currently not supported on non-{16,32,64}");

-use core::iter::Rev;
+pub use crate::memchr::{
+    memchr, memchr2, memchr2_iter, memchr3, memchr3_iter, memchr_iter,
+    memrchr, memrchr2, memrchr2_iter, memrchr3, memrchr3_iter, memrchr_iter,
+    Memchr, Memchr2, Memchr3,
+};

-pub use crate::iter::{Memchr, Memchr2, Memchr3};
-
-// N.B. If you're looking for the cfg knobs for libc, see build.rs.
-#[cfg(memchr_libc)]
-mod c;
-#[allow(dead_code)]
-mod fallback;
-mod iter;
-mod naive;
+mod cow;
+mod memchr;
+pub mod memmem;
 #[cfg(test)]
 mod tests;
-#[cfg(all(not(miri), target_arch = "x86_64", memchr_runtime_simd))]
-mod x86;
-
-/// An iterator over all occurrences of the needle in a haystack.
-#[inline]
-pub fn memchr_iter(needle: u8, haystack: &[u8]) -> Memchr<'_> {
-    Memchr::new(needle, haystack)
-}
-
-/// An iterator over all occurrences of the needles in a haystack.
-#[inline]
-pub fn memchr2_iter(needle1: u8, needle2: u8, haystack: &[u8]) -> Memchr2<'_> {
-    Memchr2::new(needle1, needle2, haystack)
-}
-
-/// An iterator over all occurrences of the needles in a haystack.
-#[inline]
-pub fn memchr3_iter(
-    needle1: u8,
-    needle2: u8,
-    needle3: u8,
-    haystack: &[u8],
-) -> Memchr3<'_> {
-    Memchr3::new(needle1, needle2, needle3, haystack)
-}
-
-/// An iterator over all occurrences of the needle in a haystack, in reverse.
-#[inline]
-pub fn memrchr_iter(needle: u8, haystack: &[u8]) -> Rev<Memchr<'_>> {
-    Memchr::new(needle, haystack).rev()
-}
-
-/// An iterator over all occurrences of the needles in a haystack, in reverse.
-#[inline]
-pub fn memrchr2_iter(
-    needle1: u8,
-    needle2: u8,
-    haystack: &[u8],
-) -> Rev<Memchr2<'_>> {
-    Memchr2::new(needle1, needle2, haystack).rev()
-}
-
-/// An iterator over all occurrences of the needles in a haystack, in reverse.
-#[inline]
-pub fn memrchr3_iter(
-    needle1: u8,
-    needle2: u8,
-    needle3: u8,
-    haystack: &[u8],
-) -> Rev<Memchr3<'_>> {
-    Memchr3::new(needle1, needle2, needle3, haystack).rev()
-}
-
-/// Search for the first occurrence of a byte in a slice.
-///
-/// This returns the index corresponding to the first occurrence of `needle` in
-/// `haystack`, or `None` if one is not found.
-///
-/// While this is operationally the same as something like
-/// `haystack.iter().position(|&b| b == needle)`, `memchr` will use a highly
-/// optimized routine that can be up to an order of magnitude faster in some
-/// cases.
-///
-/// # Example
-///
-/// This shows how to find the first position of a byte in a byte string.
-///
-/// ```
-/// use memchr::memchr;
-///
-/// let haystack = b"the quick brown fox";
-/// assert_eq!(memchr(b'k', haystack), Some(8));
-/// ```
-#[inline]
-pub fn memchr(needle: u8, haystack: &[u8]) -> Option<usize> {
-    #[cfg(miri)]
-    #[inline(always)]
-    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
-        naive::memchr(n1, haystack)
-    }
-
-    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
-    #[inline(always)]
-    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
-        x86::memchr(n1, haystack)
-    }
-
-    #[cfg(all(
-        memchr_libc,
-        not(all(target_arch = "x86_64", memchr_runtime_simd)),
-        not(miri),
-    ))]
-    #[inline(always)]
-    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
-        c::memchr(n1, haystack)
-    }
-
-    #[cfg(all(
-        not(memchr_libc),
-        not(all(target_arch = "x86_64", memchr_runtime_simd)),
-        not(miri),
-    ))]
-    #[inline(always)]
-    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
-        fallback::memchr(n1, haystack)
-    }
-
-    if haystack.is_empty() {
-        None
-    } else {
-        imp(needle, haystack)
-    }
-}
-
-/// Like `memchr`, but searches for either of two bytes instead of just one.
-///
-/// This returns the index corresponding to the first occurrence of `needle1`
-/// or the first occurrence of `needle2` in `haystack` (whichever occurs
-/// earlier), or `None` if neither one is found.
-///
-/// While this is operationally the same as something like
-/// `haystack.iter().position(|&b| b == needle1 || b == needle2)`, `memchr2`
-/// will use a highly optimized routine that can be up to an order of magnitude
-/// faster in some cases.
-///
-/// # Example
-///
-/// This shows how to find the first position of either of two bytes in a byte
-/// string.
-///
-/// ```
-/// use memchr::memchr2;
-///
-/// let haystack = b"the quick brown fox";
-/// assert_eq!(memchr2(b'k', b'q', haystack), Some(4));
-/// ```
-#[inline]
-pub fn memchr2(needle1: u8, needle2: u8, haystack: &[u8]) -> Option<usize> {
-    #[cfg(miri)]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
-        naive::memchr2(n1, n2, haystack)
-    }
-
-    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
-        x86::memchr2(n1, n2, haystack)
-    }
-
-    #[cfg(all(
-        not(all(target_arch = "x86_64", memchr_runtime_simd)),
-        not(miri),
-    ))]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
-        fallback::memchr2(n1, n2, haystack)
-    }
-
-    if haystack.is_empty() {
-        None
-    } else {
-        imp(needle1, needle2, haystack)
-    }
-}
-
-/// Like `memchr`, but searches for any of three bytes instead of just one.
-///
-/// This returns the index corresponding to the first occurrence of `needle1`,
-/// the first occurrence of `needle2`, or the first occurrence of `needle3` in
-/// `haystack` (whichever occurs earliest), or `None` if none are found.
-///
-/// While this is operationally the same as something like
-/// `haystack.iter().position(|&b| b == needle1 || b == needle2 ||
-/// b == needle3)`, `memchr3` will use a highly optimized routine that can be
-/// up to an order of magnitude faster in some cases.
-///
-/// # Example
-///
-/// This shows how to find the first position of any of three bytes in a byte
-/// string.
-///
-/// ```
-/// use memchr::memchr3;
-///
-/// let haystack = b"the quick brown fox";
-/// assert_eq!(memchr3(b'k', b'q', b'e', haystack), Some(2));
-/// ```
-#[inline]
-pub fn memchr3(
-    needle1: u8,
-    needle2: u8,
-    needle3: u8,
-    haystack: &[u8],
-) -> Option<usize> {
-    #[cfg(miri)]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
-        naive::memchr3(n1, n2, n3, haystack)
-    }
-
-    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
-        x86::memchr3(n1, n2, n3, haystack)
-    }
-
-    #[cfg(all(
-        not(all(target_arch = "x86_64", memchr_runtime_simd)),
-        not(miri),
-    ))]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
-        fallback::memchr3(n1, n2, n3, haystack)
-    }
-
-    if haystack.is_empty() {
-        None
-    } else {
-        imp(needle1, needle2, needle3, haystack)
-    }
-}
-
-/// Search for the last occurrence of a byte in a slice.
-///
-/// This returns the index corresponding to the last occurrence of `needle` in
-/// `haystack`, or `None` if one is not found.
-///
-/// While this is operationally the same as something like
-/// `haystack.iter().rposition(|&b| b == needle)`, `memrchr` will use a highly
-/// optimized routine that can be up to an order of magnitude faster in some
-/// cases.
-///
-/// # Example
-///
-/// This shows how to find the last position of a byte in a byte string.
-///
-/// ```
-/// use memchr::memrchr;
-///
-/// let haystack = b"the quick brown fox";
-/// assert_eq!(memrchr(b'o', haystack), Some(17));
-/// ```
-#[inline]
-pub fn memrchr(needle: u8, haystack: &[u8]) -> Option<usize> {
-    #[cfg(miri)]
-    #[inline(always)]
-    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
-        naive::memrchr(n1, haystack)
-    }
-
-    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
-    #[inline(always)]
-    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
-        x86::memrchr(n1, haystack)
-    }
-
-    #[cfg(all(
-        memchr_libc,
-        target_os = "linux",
-        not(all(target_arch = "x86_64", memchr_runtime_simd)),
-        not(miri)
-    ))]
-    #[inline(always)]
-    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
-        c::memrchr(n1, haystack)
-    }
-
-    #[cfg(all(
-        not(all(memchr_libc, target_os = "linux")),
-        not(all(target_arch = "x86_64", memchr_runtime_simd)),
-        not(miri),
-    ))]
-    #[inline(always)]
-    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
-        fallback::memrchr(n1, haystack)
-    }
-
-    if haystack.is_empty() {
-        None
-    } else {
-        imp(needle, haystack)
-    }
-}
-
-/// Like `memrchr`, but searches for either of two bytes instead of just one.
-///
-/// This returns the index corresponding to the last occurrence of `needle1`
-/// or the last occurrence of `needle2` in `haystack` (whichever occurs later),
-/// or `None` if neither one is found.
-///
-/// While this is operationally the same as something like
-/// `haystack.iter().rposition(|&b| b == needle1 || b == needle2)`, `memrchr2`
-/// will use a highly optimized routine that can be up to an order of magnitude
-/// faster in some cases.
-///
-/// # Example
-///
-/// This shows how to find the last position of either of two bytes in a byte
-/// string.
-///
-/// ```
-/// use memchr::memrchr2;
-///
-/// let haystack = b"the quick brown fox";
-/// assert_eq!(memrchr2(b'k', b'q', haystack), Some(8));
-/// ```
-#[inline]
-pub fn memrchr2(needle1: u8, needle2: u8, haystack: &[u8]) -> Option<usize> {
-    #[cfg(miri)]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
-        naive::memrchr2(n1, n2, haystack)
-    }
-
-    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
-        x86::memrchr2(n1, n2, haystack)
-    }
-
-    #[cfg(all(
-        not(all(target_arch = "x86_64", memchr_runtime_simd)),
-        not(miri),
-    ))]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
-        fallback::memrchr2(n1, n2, haystack)
-    }
-
-    if haystack.is_empty() {
-        None
-    } else {
-        imp(needle1, needle2, haystack)
-    }
-}
-
-/// Like `memrchr`, but searches for any of three bytes instead of just one.
-///
-/// This returns the index corresponding to the last occurrence of `needle1`,
-/// the last occurrence of `needle2`, or the last occurrence of `needle3` in
-/// `haystack` (whichever occurs later), or `None` if none are found.
-///
-/// While this is operationally the same as something like
-/// `haystack.iter().rposition(|&b| b == needle1 || b == needle2 ||
-/// b == needle3)`, `memrchr3` will use a highly optimized routine that can be
-/// up to an order of magnitude faster in some cases.
-///
-/// # Example
-///
-/// This shows how to find the last position of any of three bytes in a byte
-/// string.
-///
-/// ```
-/// use memchr::memrchr3;
-///
-/// let haystack = b"the quick brown fox";
-/// assert_eq!(memrchr3(b'k', b'q', b'e', haystack), Some(8));
-/// ```
-#[inline]
-pub fn memrchr3(
-    needle1: u8,
-    needle2: u8,
-    needle3: u8,
-    haystack: &[u8],
-) -> Option<usize> {
-    #[cfg(miri)]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
-        naive::memrchr3(n1, n2, n3, haystack)
-    }
-
-    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
-        x86::memrchr3(n1, n2, n3, haystack)
-    }
-
-    #[cfg(all(
-        not(all(target_arch = "x86_64", memchr_runtime_simd)),
-        not(miri),
-    ))]
-    #[inline(always)]
-    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
-        fallback::memrchr3(n1, n2, n3, haystack)
-    }
-
-    if haystack.is_empty() {
-        None
-    } else {
-        imp(needle1, needle2, needle3, haystack)
-    }
-}
@@ -6,6 +6,7 @@
 use libc::{c_int, c_void, size_t};

 pub fn memchr(needle: u8, haystack: &[u8]) -> Option<usize> {
+    // SAFETY: This is safe to call since all pointers are valid.
    let p = unsafe {
        libc::memchr(
            haystack.as_ptr() as *const c_void,
@@ -27,6 +28,7 @@ pub fn memrchr(needle: u8, haystack: &[u8]) -> Option<usize> {
    if haystack.is_empty() {
        return None;
    }
+    // SAFETY: This is safe to call since all pointers are valid.
    let p = unsafe {
        libc::memrchr(
            haystack.as_ptr() as *const c_void,
@@ -49,10 +49,10 @@ pub fn memchr(n1: u8, haystack: &[u8]) -> Option<usize> {
    let loop_size = cmp::min(LOOP_SIZE, haystack.len());
    let align = USIZE_BYTES - 1;
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
    let mut ptr = start_ptr;

    unsafe {
+        let end_ptr = start_ptr.add(haystack.len());
        if haystack.len() < USIZE_BYTES {
            return forward_search(start_ptr, end_ptr, ptr, confirm);
        }
@@ -88,10 +88,10 @@ pub fn memchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
    let confirm = |byte| byte == n1 || byte == n2;
    let align = USIZE_BYTES - 1;
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
    let mut ptr = start_ptr;

    unsafe {
+        let end_ptr = start_ptr.add(haystack.len());
        if haystack.len() < USIZE_BYTES {
            return forward_search(start_ptr, end_ptr, ptr, confirm);
        }
@@ -129,10 +129,10 @@ pub fn memchr3(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
    let confirm = |byte| byte == n1 || byte == n2 || byte == n3;
    let align = USIZE_BYTES - 1;
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
    let mut ptr = start_ptr;

    unsafe {
+        let end_ptr = start_ptr.add(haystack.len());
        if haystack.len() < USIZE_BYTES {
            return forward_search(start_ptr, end_ptr, ptr, confirm);
        }
@@ -171,10 +171,10 @@ pub fn memrchr(n1: u8, haystack: &[u8]) -> Option<usize> {
    let loop_size = cmp::min(LOOP_SIZE, haystack.len());
    let align = USIZE_BYTES - 1;
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
-    let mut ptr = end_ptr;

    unsafe {
+        let end_ptr = start_ptr.add(haystack.len());
+        let mut ptr = end_ptr;
        if haystack.len() < USIZE_BYTES {
            return reverse_search(start_ptr, end_ptr, ptr, confirm);
        }
@@ -209,10 +209,10 @@ pub fn memrchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
    let confirm = |byte| byte == n1 || byte == n2;
    let align = USIZE_BYTES - 1;
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
-    let mut ptr = end_ptr;

    unsafe {
+        let end_ptr = start_ptr.add(haystack.len());
+        let mut ptr = end_ptr;
        if haystack.len() < USIZE_BYTES {
            return reverse_search(start_ptr, end_ptr, ptr, confirm);
        }
@@ -249,10 +249,10 @@ pub fn memrchr3(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
    let confirm = |byte| byte == n1 || byte == n2 || byte == n3;
    let align = USIZE_BYTES - 1;
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
-    let mut ptr = end_ptr;

    unsafe {
+        let end_ptr = start_ptr.add(haystack.len());
+        let mut ptr = end_ptr;
        if haystack.len() < USIZE_BYTES {
            return reverse_search(start_ptr, end_ptr, ptr, confirm);
        }
@@ -0,0 +1,410 @@
+use core::iter::Rev;
+
+pub use self::iter::{Memchr, Memchr2, Memchr3};
+
+// N.B. If you're looking for the cfg knobs for libc, see build.rs.
+#[cfg(memchr_libc)]
+mod c;
+#[allow(dead_code)]
+pub mod fallback;
+mod iter;
+pub mod naive;
+#[cfg(all(not(miri), target_arch = "x86_64", memchr_runtime_simd))]
+mod x86;
+
+/// An iterator over all occurrences of the needle in a haystack.
+#[inline]
+pub fn memchr_iter(needle: u8, haystack: &[u8]) -> Memchr<'_> {
+    Memchr::new(needle, haystack)
+}
+
+/// An iterator over all occurrences of the needles in a haystack.
+#[inline]
+pub fn memchr2_iter(needle1: u8, needle2: u8, haystack: &[u8]) -> Memchr2<'_> {
+    Memchr2::new(needle1, needle2, haystack)
+}
+
+/// An iterator over all occurrences of the needles in a haystack.
+#[inline]
+pub fn memchr3_iter(
+    needle1: u8,
+    needle2: u8,
+    needle3: u8,
+    haystack: &[u8],
+) -> Memchr3<'_> {
+    Memchr3::new(needle1, needle2, needle3, haystack)
+}
+
+/// An iterator over all occurrences of the needle in a haystack, in reverse.
+#[inline]
+pub fn memrchr_iter(needle: u8, haystack: &[u8]) -> Rev<Memchr<'_>> {
+    Memchr::new(needle, haystack).rev()
+}
+
+/// An iterator over all occurrences of the needles in a haystack, in reverse.
+#[inline]
+pub fn memrchr2_iter(
+    needle1: u8,
+    needle2: u8,
+    haystack: &[u8],
+) -> Rev<Memchr2<'_>> {
+    Memchr2::new(needle1, needle2, haystack).rev()
+}
+
+/// An iterator over all occurrences of the needles in a haystack, in reverse.
+#[inline]
+pub fn memrchr3_iter(
+    needle1: u8,
+    needle2: u8,
+    needle3: u8,
+    haystack: &[u8],
+) -> Rev<Memchr3<'_>> {
+    Memchr3::new(needle1, needle2, needle3, haystack).rev()
+}
+
+/// Search for the first occurrence of a byte in a slice.
+///
+/// This returns the index corresponding to the first occurrence of `needle` in
+/// `haystack`, or `None` if one is not found. If an index is returned, it is
+/// guaranteed to be less than `usize::MAX`.
+///
+/// While this is operationally the same as something like
+/// `haystack.iter().position(|&b| b == needle)`, `memchr` will use a highly
+/// optimized routine that can be up to an order of magnitude faster in some
+/// cases.
+///
+/// # Example
+///
+/// This shows how to find the first position of a byte in a byte string.
+///
+/// ```
+/// use memchr::memchr;
+///
+/// let haystack = b"the quick brown fox";
+/// assert_eq!(memchr(b'k', haystack), Some(8));
+/// ```
+#[inline]
+pub fn memchr(needle: u8, haystack: &[u8]) -> Option<usize> {
+    #[cfg(miri)]
+    #[inline(always)]
+    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
+        naive::memchr(n1, haystack)
+    }
+
+    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
+    #[inline(always)]
+    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
+        x86::memchr(n1, haystack)
+    }
+
+    #[cfg(all(
+        memchr_libc,
+        not(all(target_arch = "x86_64", memchr_runtime_simd)),
+        not(miri),
+    ))]
+    #[inline(always)]
+    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
+        c::memchr(n1, haystack)
+    }
+
+    #[cfg(all(
+        not(memchr_libc),
+        not(all(target_arch = "x86_64", memchr_runtime_simd)),
+        not(miri),
+    ))]
+    #[inline(always)]
+    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
+        fallback::memchr(n1, haystack)
+    }
+
+    if haystack.is_empty() {
+        None
+    } else {
+        imp(needle, haystack)
+    }
+}
+
+/// Like `memchr`, but searches for either of two bytes instead of just one.
+///
+/// This returns the index corresponding to the first occurrence of `needle1`
+/// or the first occurrence of `needle2` in `haystack` (whichever occurs
+/// earlier), or `None` if neither one is found. If an index is returned, it is
+/// guaranteed to be less than `usize::MAX`.
+///
+/// While this is operationally the same as something like
+/// `haystack.iter().position(|&b| b == needle1 || b == needle2)`, `memchr2`
+/// will use a highly optimized routine that can be up to an order of magnitude
+/// faster in some cases.
+///
+/// # Example
+///
+/// This shows how to find the first position of either of two bytes in a byte
+/// string.
+///
+/// ```
+/// use memchr::memchr2;
+///
+/// let haystack = b"the quick brown fox";
+/// assert_eq!(memchr2(b'k', b'q', haystack), Some(4));
+/// ```
+#[inline]
+pub fn memchr2(needle1: u8, needle2: u8, haystack: &[u8]) -> Option<usize> {
+    #[cfg(miri)]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
+        naive::memchr2(n1, n2, haystack)
+    }
+
+    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
+        x86::memchr2(n1, n2, haystack)
+    }
+
+    #[cfg(all(
+        not(all(target_arch = "x86_64", memchr_runtime_simd)),
+        not(miri),
+    ))]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
+        fallback::memchr2(n1, n2, haystack)
+    }
+
+    if haystack.is_empty() {
+        None
+    } else {
+        imp(needle1, needle2, haystack)
+    }
+}
+
+/// Like `memchr`, but searches for any of three bytes instead of just one.
+///
+/// This returns the index corresponding to the first occurrence of `needle1`,
+/// the first occurrence of `needle2`, or the first occurrence of `needle3` in
+/// `haystack` (whichever occurs earliest), or `None` if none are found. If an
+/// index is returned, it is guaranteed to be less than `usize::MAX`.
+///
+/// While this is operationally the same as something like
+/// `haystack.iter().position(|&b| b == needle1 || b == needle2 ||
+/// b == needle3)`, `memchr3` will use a highly optimized routine that can be
+/// up to an order of magnitude faster in some cases.
+///
+/// # Example
+///
+/// This shows how to find the first position of any of three bytes in a byte
+/// string.
+///
+/// ```
+/// use memchr::memchr3;
+///
+/// let haystack = b"the quick brown fox";
+/// assert_eq!(memchr3(b'k', b'q', b'e', haystack), Some(2));
+/// ```
+#[inline]
+pub fn memchr3(
+    needle1: u8,
+    needle2: u8,
+    needle3: u8,
+    haystack: &[u8],
+) -> Option<usize> {
+    #[cfg(miri)]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
+        naive::memchr3(n1, n2, n3, haystack)
+    }
+
+    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
+        x86::memchr3(n1, n2, n3, haystack)
+    }
+
+    #[cfg(all(
+        not(all(target_arch = "x86_64", memchr_runtime_simd)),
+        not(miri),
+    ))]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
+        fallback::memchr3(n1, n2, n3, haystack)
+    }
+
+    if haystack.is_empty() {
+        None
+    } else {
+        imp(needle1, needle2, needle3, haystack)
+    }
+}
+
+/// Search for the last occurrence of a byte in a slice.
+///
+/// This returns the index corresponding to the last occurrence of `needle` in
+/// `haystack`, or `None` if one is not found. If an index is returned, it is
+/// guaranteed to be less than `usize::MAX`.
+///
+/// While this is operationally the same as something like
+/// `haystack.iter().rposition(|&b| b == needle)`, `memrchr` will use a highly
+/// optimized routine that can be up to an order of magnitude faster in some
+/// cases.
+///
+/// # Example
+///
+/// This shows how to find the last position of a byte in a byte string.
+///
+/// ```
+/// use memchr::memrchr;
+///
+/// let haystack = b"the quick brown fox";
+/// assert_eq!(memrchr(b'o', haystack), Some(17));
+/// ```
+#[inline]
+pub fn memrchr(needle: u8, haystack: &[u8]) -> Option<usize> {
+    #[cfg(miri)]
+    #[inline(always)]
+    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
+        naive::memrchr(n1, haystack)
+    }
+
+    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
+    #[inline(always)]
+    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
+        x86::memrchr(n1, haystack)
+    }
+
+    #[cfg(all(
+        memchr_libc,
+        target_os = "linux",
+        not(all(target_arch = "x86_64", memchr_runtime_simd)),
+        not(miri)
+    ))]
+    #[inline(always)]
+    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
+        c::memrchr(n1, haystack)
+    }
+
+    #[cfg(all(
+        not(all(memchr_libc, target_os = "linux")),
+        not(all(target_arch = "x86_64", memchr_runtime_simd)),
+        not(miri),
+    ))]
+    #[inline(always)]
+    fn imp(n1: u8, haystack: &[u8]) -> Option<usize> {
+        fallback::memrchr(n1, haystack)
+    }
+
+    if haystack.is_empty() {
+        None
+    } else {
+        imp(needle, haystack)
+    }
+}
+
+/// Like `memrchr`, but searches for either of two bytes instead of just one.
+///
+/// This returns the index corresponding to the last occurrence of `needle1` or
+/// the last occurrence of `needle2` in `haystack` (whichever occurs later), or
+/// `None` if neither one is found. If an index is returned, it is guaranteed
+/// to be less than `usize::MAX`.
+///
+/// While this is operationally the same as something like
+/// `haystack.iter().rposition(|&b| b == needle1 || b == needle2)`, `memrchr2`
+/// will use a highly optimized routine that can be up to an order of magnitude
+/// faster in some cases.
+///
+/// # Example
+///
+/// This shows how to find the last position of either of two bytes in a byte
+/// string.
+///
+/// ```
+/// use memchr::memrchr2;
+///
+/// let haystack = b"the quick brown fox";
+/// assert_eq!(memrchr2(b'k', b'q', haystack), Some(8));
+/// ```
+#[inline]
+pub fn memrchr2(needle1: u8, needle2: u8, haystack: &[u8]) -> Option<usize> {
+    #[cfg(miri)]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
+        naive::memrchr2(n1, n2, haystack)
+    }
+
+    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
+        x86::memrchr2(n1, n2, haystack)
+    }
+
+    #[cfg(all(
+        not(all(target_arch = "x86_64", memchr_runtime_simd)),
+        not(miri),
+    ))]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
+        fallback::memrchr2(n1, n2, haystack)
+    }
+
+    if haystack.is_empty() {
+        None
+    } else {
+        imp(needle1, needle2, haystack)
+    }
+}
+
+/// Like `memrchr`, but searches for any of three bytes instead of just one.
+///
+/// This returns the index corresponding to the last occurrence of `needle1`,
+/// the last occurrence of `needle2`, or the last occurrence of `needle3` in
+/// `haystack` (whichever occurs later), or `None` if none are found. If an
+/// index is returned, it is guaranteed to be less than `usize::MAX`.
+///
+/// While this is operationally the same as something like
+/// `haystack.iter().rposition(|&b| b == needle1 || b == needle2 ||
+/// b == needle3)`, `memrchr3` will use a highly optimized routine that can be
+/// up to an order of magnitude faster in some cases.
+///
+/// # Example
+///
+/// This shows how to find the last position of any of three bytes in a byte
+/// string.
+///
+/// ```
+/// use memchr::memrchr3;
+///
+/// let haystack = b"the quick brown fox";
+/// assert_eq!(memrchr3(b'k', b'q', b'e', haystack), Some(8));
+/// ```
+#[inline]
+pub fn memrchr3(
+    needle1: u8,
+    needle2: u8,
+    needle3: u8,
+    haystack: &[u8],
+) -> Option<usize> {
+    #[cfg(miri)]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
+        naive::memrchr3(n1, n2, n3, haystack)
+    }
+
+    #[cfg(all(target_arch = "x86_64", memchr_runtime_simd, not(miri)))]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
+        x86::memrchr3(n1, n2, n3, haystack)
+    }
+
+    #[cfg(all(
+        not(all(target_arch = "x86_64", memchr_runtime_simd)),
+        not(miri),
+    ))]
+    #[inline(always)]
+    fn imp(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
+        fallback::memrchr3(n1, n2, n3, haystack)
+    }
+
+    if haystack.is_empty() {
+        None
+    } else {
+        imp(needle1, needle2, needle3, haystack)
+    }
+}
@@ -1,6 +1,6 @@
 use core::{arch::x86_64::*, cmp, mem::size_of};

-use crate::x86::sse2;
+use super::sse2;

 const VECTOR_SIZE: usize = size_of::<__m256i>();
 const VECTOR_ALIGN: usize = VECTOR_SIZE - 1;
@@ -20,8 +20,50 @@ pub unsafe fn memchr(n1: u8, haystack: &[u8]) -> Option<usize> {
    // sse2 implementation. The avx implementation here is the same, but with
    // 256-bit vectors instead of 128-bit vectors.

+    // This routine is called whenever a match is detected. It is specifically
+    // marked as unlineable because it improves the codegen of the unrolled
+    // loop below. Inlining this seems to cause codegen with some extra adds
+    // and a load that aren't necessary. This seems to result in about a 10%
+    // improvement for the memchr1/crate/huge/never benchmark.
+    //
+    // Interestingly, I couldn't observe a similar improvement for memrchr.
+    #[cold]
+    #[inline(never)]
+    #[target_feature(enable = "avx2")]
+    unsafe fn matched(
+        start_ptr: *const u8,
+        ptr: *const u8,
+        eqa: __m256i,
+        eqb: __m256i,
+        eqc: __m256i,
+        eqd: __m256i,
+    ) -> usize {
+        let mut at = sub(ptr, start_ptr);
+        let mask = _mm256_movemask_epi8(eqa);
+        if mask != 0 {
+            return at + forward_pos(mask);
+        }
+
+        at += VECTOR_SIZE;
+        let mask = _mm256_movemask_epi8(eqb);
+        if mask != 0 {
+            return at + forward_pos(mask);
+        }
+
+        at += VECTOR_SIZE;
+        let mask = _mm256_movemask_epi8(eqc);
+        if mask != 0 {
+            return at + forward_pos(mask);
+        }
+
+        at += VECTOR_SIZE;
+        let mask = _mm256_movemask_epi8(eqd);
+        debug_assert!(mask != 0);
+        at + forward_pos(mask)
+    }
+
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = start_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -52,29 +94,9 @@ pub unsafe fn memchr(n1: u8, haystack: &[u8]) -> Option<usize> {
        let or1 = _mm256_or_si256(eqa, eqb);
        let or2 = _mm256_or_si256(eqc, eqd);
        let or3 = _mm256_or_si256(or1, or2);
+
        if _mm256_movemask_epi8(or3) != 0 {
-            let mut at = sub(ptr, start_ptr);
-            let mask = _mm256_movemask_epi8(eqa);
-            if mask != 0 {
-                return Some(at + forward_pos(mask));
-            }
-
-            at += VECTOR_SIZE;
-            let mask = _mm256_movemask_epi8(eqb);
-            if mask != 0 {
-                return Some(at + forward_pos(mask));
-            }
-
-            at += VECTOR_SIZE;
-            let mask = _mm256_movemask_epi8(eqc);
-            if mask != 0 {
-                return Some(at + forward_pos(mask));
-            }
-
-            at += VECTOR_SIZE;
-            let mask = _mm256_movemask_epi8(eqd);
-            debug_assert!(mask != 0);
-            return Some(at + forward_pos(mask));
+            return Some(matched(start_ptr, ptr, eqa, eqb, eqc, eqd));
        }
        ptr = ptr.add(loop_size);
    }
@@ -98,12 +120,36 @@ pub unsafe fn memchr(n1: u8, haystack: &[u8]) -> Option<usize> {

 #[target_feature(enable = "avx2")]
 pub unsafe fn memchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
+    #[cold]
+    #[inline(never)]
+    #[target_feature(enable = "avx2")]
+    unsafe fn matched(
+        start_ptr: *const u8,
+        ptr: *const u8,
+        eqa1: __m256i,
+        eqa2: __m256i,
+        eqb1: __m256i,
+        eqb2: __m256i,
+    ) -> usize {
+        let mut at = sub(ptr, start_ptr);
+        let mask1 = _mm256_movemask_epi8(eqa1);
+        let mask2 = _mm256_movemask_epi8(eqa2);
+        if mask1 != 0 || mask2 != 0 {
+            return at + forward_pos2(mask1, mask2);
+        }
+
+        at += VECTOR_SIZE;
+        let mask1 = _mm256_movemask_epi8(eqb1);
+        let mask2 = _mm256_movemask_epi8(eqb2);
+        at + forward_pos2(mask1, mask2)
+    }
+
    let vn1 = _mm256_set1_epi8(n1 as i8);
    let vn2 = _mm256_set1_epi8(n2 as i8);
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE2, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = start_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -135,17 +181,7 @@ pub unsafe fn memchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
        let or2 = _mm256_or_si256(eqa2, eqb2);
        let or3 = _mm256_or_si256(or1, or2);
        if _mm256_movemask_epi8(or3) != 0 {
-            let mut at = sub(ptr, start_ptr);
-            let mask1 = _mm256_movemask_epi8(eqa1);
-            let mask2 = _mm256_movemask_epi8(eqa2);
-            if mask1 != 0 || mask2 != 0 {
-                return Some(at + forward_pos2(mask1, mask2));
-            }
-
-            at += VECTOR_SIZE;
-            let mask1 = _mm256_movemask_epi8(eqb1);
-            let mask2 = _mm256_movemask_epi8(eqb2);
-            return Some(at + forward_pos2(mask1, mask2));
+            return Some(matched(start_ptr, ptr, eqa1, eqa2, eqb1, eqb2));
        }
        ptr = ptr.add(loop_size);
    }
@@ -172,13 +208,41 @@ pub unsafe fn memchr3(
    n3: u8,
    haystack: &[u8],
 ) -> Option<usize> {
+    #[cold]
+    #[inline(never)]
+    #[target_feature(enable = "avx2")]
+    unsafe fn matched(
+        start_ptr: *const u8,
+        ptr: *const u8,
+        eqa1: __m256i,
+        eqa2: __m256i,
+        eqa3: __m256i,
+        eqb1: __m256i,
+        eqb2: __m256i,
+        eqb3: __m256i,
+    ) -> usize {
+        let mut at = sub(ptr, start_ptr);
+        let mask1 = _mm256_movemask_epi8(eqa1);
+        let mask2 = _mm256_movemask_epi8(eqa2);
+        let mask3 = _mm256_movemask_epi8(eqa3);
+        if mask1 != 0 || mask2 != 0 || mask3 != 0 {
+            return at + forward_pos3(mask1, mask2, mask3);
+        }
+
+        at += VECTOR_SIZE;
+        let mask1 = _mm256_movemask_epi8(eqb1);
+        let mask2 = _mm256_movemask_epi8(eqb2);
+        let mask3 = _mm256_movemask_epi8(eqb3);
+        at + forward_pos3(mask1, mask2, mask3)
+    }
+
    let vn1 = _mm256_set1_epi8(n1 as i8);
    let vn2 = _mm256_set1_epi8(n2 as i8);
    let vn3 = _mm256_set1_epi8(n3 as i8);
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE2, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = start_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -214,19 +278,9 @@ pub unsafe fn memchr3(
        let or4 = _mm256_or_si256(or1, or2);
        let or5 = _mm256_or_si256(or3, or4);
        if _mm256_movemask_epi8(or5) != 0 {
-            let mut at = sub(ptr, start_ptr);
-            let mask1 = _mm256_movemask_epi8(eqa1);
-            let mask2 = _mm256_movemask_epi8(eqa2);
-            let mask3 = _mm256_movemask_epi8(eqa3);
-            if mask1 != 0 || mask2 != 0 || mask3 != 0 {
-                return Some(at + forward_pos3(mask1, mask2, mask3));
-            }
-
-            at += VECTOR_SIZE;
-            let mask1 = _mm256_movemask_epi8(eqb1);
-            let mask2 = _mm256_movemask_epi8(eqb2);
-            let mask3 = _mm256_movemask_epi8(eqb3);
-            return Some(at + forward_pos3(mask1, mask2, mask3));
+            return Some(matched(
+                start_ptr, ptr, eqa1, eqa2, eqa3, eqb1, eqb2, eqb3,
+            ));
        }
        ptr = ptr.add(loop_size);
    }
@@ -254,7 +308,7 @@ pub unsafe fn memrchr(n1: u8, haystack: &[u8]) -> Option<usize> {
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = end_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -334,7 +388,7 @@ pub unsafe fn memrchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE2, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = end_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -407,7 +461,7 @@ pub unsafe fn memrchr3(
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE2, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = end_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -0,0 +1,148 @@
+use super::fallback;
+
+// We only use AVX when we can detect at runtime whether it's available, which
+// requires std.
+#[cfg(feature = "std")]
+mod avx;
+mod sse2;
+
+/// This macro employs a gcc-like "ifunc" trick where by upon first calling
+/// `memchr` (for example), CPU feature detection will be performed at runtime
+/// to determine the best implementation to use. After CPU feature detection
+/// is done, we replace `memchr`'s function pointer with the selection. Upon
+/// subsequent invocations, the CPU-specific routine is invoked directly, which
+/// skips the CPU feature detection and subsequent branch that's required.
+///
+/// While this typically doesn't matter for rare occurrences or when used on
+/// larger haystacks, `memchr` can be called in tight loops where the overhead
+/// of this branch can actually add up *and is measurable*. This trick was
+/// necessary to bring this implementation up to glibc's speeds for the 'tiny'
+/// benchmarks, for example.
+///
+/// At some point, I expect the Rust ecosystem will get a nice macro for doing
+/// exactly this, at which point, we can replace our hand-jammed version of it.
+///
+/// N.B. The ifunc strategy does prevent function inlining of course, but
+/// on modern CPUs, you'll probably end up with the AVX2 implementation,
+/// which probably can't be inlined anyway---unless you've compiled your
+/// entire program with AVX2 enabled. However, even then, the various memchr
+/// implementations aren't exactly small, so inlining might not help anyway!
+///
+/// # Safety
+///
+/// Callers must ensure that fnty is function pointer type.
+#[cfg(feature = "std")]
+macro_rules! unsafe_ifunc {
+    ($fnty:ty, $name:ident, $haystack:ident, $($needle:ident),+) => {{
+        use std::{mem, sync::atomic::{AtomicPtr, Ordering}};
+
+        type FnRaw = *mut ();
+
+        static FN: AtomicPtr<()> = AtomicPtr::new(detect as FnRaw);
+
+        fn detect($($needle: u8),+, haystack: &[u8]) -> Option<usize> {
+            let fun =
+                if cfg!(memchr_runtime_avx) && is_x86_feature_detected!("avx2") {
+                    avx::$name as FnRaw
+                } else if cfg!(memchr_runtime_sse2) {
+                    sse2::$name as FnRaw
+                } else {
+                    fallback::$name as FnRaw
+                };
+            FN.store(fun as FnRaw, Ordering::Relaxed);
+            // SAFETY: By virtue of the caller contract, $fnty is a function
+            // pointer, which is always safe to transmute with a *mut ().
+            // Also, if 'fun is the AVX routine, then it is guaranteed to be
+            // supported since we checked the avx2 feature.
+            unsafe {
+                mem::transmute::<FnRaw, $fnty>(fun)($($needle),+, haystack)
+            }
+        }
+
+        // SAFETY: By virtue of the caller contract, $fnty is a function
+        // pointer, which is always safe to transmute with a *mut (). Also, if
+        // 'fun is the AVX routine, then it is guaranteed to be supported since
+        // we checked the avx2 feature.
+        unsafe {
+            let fun = FN.load(Ordering::Relaxed);
+            mem::transmute::<FnRaw, $fnty>(fun)($($needle),+, $haystack)
+        }
+    }}
+}
+
+/// When std isn't available to provide runtime CPU feature detection, or if
+/// runtime CPU feature detection has been explicitly disabled, then just
+/// call our optimized SSE2 routine directly. SSE2 is avalbale on all x86_64
+/// targets, so no CPU feature detection is necessary.
+///
+/// # Safety
+///
+/// There are no safety requirements for this definition of the macro. It is
+/// safe for all inputs since it is restricted to either the fallback routine
+/// or the SSE routine, which is always safe to call on x86_64.
+#[cfg(not(feature = "std"))]
+macro_rules! unsafe_ifunc {
+    ($fnty:ty, $name:ident, $haystack:ident, $($needle:ident),+) => {{
+        if cfg!(memchr_runtime_sse2) {
+            unsafe { sse2::$name($($needle),+, $haystack) }
+        } else {
+            fallback::$name($($needle),+, $haystack)
+        }
+    }}
+}
+
+#[inline(always)]
+pub fn memchr(n1: u8, haystack: &[u8]) -> Option<usize> {
+    unsafe_ifunc!(fn(u8, &[u8]) -> Option<usize>, memchr, haystack, n1)
+}
+
+#[inline(always)]
+pub fn memchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
+    unsafe_ifunc!(
+        fn(u8, u8, &[u8]) -> Option<usize>,
+        memchr2,
+        haystack,
+        n1,
+        n2
+    )
+}
+
+#[inline(always)]
+pub fn memchr3(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
+    unsafe_ifunc!(
+        fn(u8, u8, u8, &[u8]) -> Option<usize>,
+        memchr3,
+        haystack,
+        n1,
+        n2,
+        n3
+    )
+}
+
+#[inline(always)]
+pub fn memrchr(n1: u8, haystack: &[u8]) -> Option<usize> {
+    unsafe_ifunc!(fn(u8, &[u8]) -> Option<usize>, memrchr, haystack, n1)
+}
+
+#[inline(always)]
+pub fn memrchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
+    unsafe_ifunc!(
+        fn(u8, u8, &[u8]) -> Option<usize>,
+        memrchr2,
+        haystack,
+        n1,
+        n2
+    )
+}
+
+#[inline(always)]
+pub fn memrchr3(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
+    unsafe_ifunc!(
+        fn(u8, u8, u8, &[u8]) -> Option<usize>,
+        memrchr3,
+        haystack,
+        n1,
+        n2,
+        n3
+    )
+}
@@ -109,7 +109,7 @@ pub unsafe fn memchr(n1: u8, haystack: &[u8]) -> Option<usize> {
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = start_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -193,7 +193,7 @@ pub unsafe fn memchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE2, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = start_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -268,7 +268,7 @@ pub unsafe fn memchr3(
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE2, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = start_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -344,7 +344,7 @@ pub unsafe fn memrchr(n1: u8, haystack: &[u8]) -> Option<usize> {
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = end_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -424,7 +424,7 @@ pub unsafe fn memrchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE2, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = end_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -497,7 +497,7 @@ pub unsafe fn memrchr3(
    let len = haystack.len();
    let loop_size = cmp::min(LOOP_SIZE2, len);
    let start_ptr = haystack.as_ptr();
-    let end_ptr = haystack[haystack.len()..].as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
    let mut ptr = end_ptr;

    if haystack.len() < VECTOR_SIZE {
@@ -0,0 +1,258 @@
+pub const BYTE_FREQUENCIES: [u8; 256] = [
+    55,  // '\x00'
+    52,  // '\x01'
+    51,  // '\x02'
+    50,  // '\x03'
+    49,  // '\x04'
+    48,  // '\x05'
+    47,  // '\x06'
+    46,  // '\x07'
+    45,  // '\x08'
+    103, // '\t'
+    242, // '\n'
+    66,  // '\x0b'
+    67,  // '\x0c'
+    229, // '\r'
+    44,  // '\x0e'
+    43,  // '\x0f'
+    42,  // '\x10'
+    41,  // '\x11'
+    40,  // '\x12'
+    39,  // '\x13'
+    38,  // '\x14'
+    37,  // '\x15'
+    36,  // '\x16'
+    35,  // '\x17'
+    34,  // '\x18'
+    33,  // '\x19'
+    56,  // '\x1a'
+    32,  // '\x1b'
+    31,  // '\x1c'
+    30,  // '\x1d'
+    29,  // '\x1e'
+    28,  // '\x1f'
+    255, // ' '
+    148, // '!'
+    164, // '"'
+    149, // '#'
+    136, // '$'
+    160, // '%'
+    155, // '&'
+    173, // "'"
+    221, // '('
+    222, // ')'
+    134, // '*'
+    122, // '+'
+    232, // ','
+    202, // '-'
+    215, // '.'
+    224, // '/'
+    208, // '0'
+    220, // '1'
+    204, // '2'
+    187, // '3'
+    183, // '4'
+    179, // '5'
+    177, // '6'
+    168, // '7'
+    178, // '8'
+    200, // '9'
+    226, // ':'
+    195, // ';'
+    154, // '<'
+    184, // '='
+    174, // '>'
+    126, // '?'
+    120, // '@'
+    191, // 'A'
+    157, // 'B'
+    194, // 'C'
+    170, // 'D'
+    189, // 'E'
+    162, // 'F'
+    161, // 'G'
+    150, // 'H'
+    193, // 'I'
+    142, // 'J'
+    137, // 'K'
+    171, // 'L'
+    176, // 'M'
+    185, // 'N'
+    167, // 'O'
+    186, // 'P'
+    112, // 'Q'
+    175, // 'R'
+    192, // 'S'
+    188, // 'T'
+    156, // 'U'
+    140, // 'V'
+    143, // 'W'
+    123, // 'X'
+    133, // 'Y'
+    128, // 'Z'
+    147, // '['
+    138, // '\\'
+    146, // ']'
+    114, // '^'
+    223, // '_'
+    151, // '`'
+    249, // 'a'
+    216, // 'b'
+    238, // 'c'
+    236, // 'd'
+    253, // 'e'
+    227, // 'f'
+    218, // 'g'
+    230, // 'h'
+    247, // 'i'
+    135, // 'j'
+    180, // 'k'
+    241, // 'l'
+    233, // 'm'
+    246, // 'n'
+    244, // 'o'
+    231, // 'p'
+    139, // 'q'
+    245, // 'r'
+    243, // 's'
+    251, // 't'
+    235, // 'u'
+    201, // 'v'
+    196, // 'w'
+    240, // 'x'
+    214, // 'y'
+    152, // 'z'
+    182, // '{'
+    205, // '|'
+    181, // '}'
+    127, // '~'
+    27,  // '\x7f'
+    212, // '\x80'
+    211, // '\x81'
+    210, // '\x82'
+    213, // '\x83'
+    228, // '\x84'
+    197, // '\x85'
+    169, // '\x86'
+    159, // '\x87'
+    131, // '\x88'
+    172, // '\x89'
+    105, // '\x8a'
+    80,  // '\x8b'
+    98,  // '\x8c'
+    96,  // '\x8d'
+    97,  // '\x8e'
+    81,  // '\x8f'
+    207, // '\x90'
+    145, // '\x91'
+    116, // '\x92'
+    115, // '\x93'
+    144, // '\x94'
+    130, // '\x95'
+    153, // '\x96'
+    121, // '\x97'
+    107, // '\x98'
+    132, // '\x99'
+    109, // '\x9a'
+    110, // '\x9b'
+    124, // '\x9c'
+    111, // '\x9d'
+    82,  // '\x9e'
+    108, // '\x9f'
+    118, // '\xa0'
+    141, // '¡'
+    113, // '¢'
+    129, // '£'
+    119, // '¤'
+    125, // '¥'
+    165, // '¦'
+    117, // '§'
+    92,  // '¨'
+    106, // '©'
+    83,  // 'ª'
+    72,  // '«'
+    99,  // '¬'
+    93,  // '\xad'
+    65,  // '®'
+    79,  // '¯'
+    166, // '°'
+    237, // '±'
+    163, // '²'
+    199, // '³'
+    190, // '´'
+    225, // 'µ'
+    209, // '¶'
+    203, // '·'
+    198, // '¸'
+    217, // '¹'
+    219, // 'º'
+    206, // '»'
+    234, // '¼'
+    248, // '½'
+    158, // '¾'
+    239, // '¿'
+    255, // 'À'
+    255, // 'Á'
+    255, // 'Â'
+    255, // 'Ã'
+    255, // 'Ä'
+    255, // 'Å'
+    255, // 'Æ'
+    255, // 'Ç'
+    255, // 'È'
+    255, // 'É'
+    255, // 'Ê'
+    255, // 'Ë'
+    255, // 'Ì'
+    255, // 'Í'
+    255, // 'Î'
+    255, // 'Ï'
+    255, // 'Ð'
+    255, // 'Ñ'
+    255, // 'Ò'
+    255, // 'Ó'
+    255, // 'Ô'
+    255, // 'Õ'
+    255, // 'Ö'
+    255, // '×'
+    255, // 'Ø'
+    255, // 'Ù'
+    255, // 'Ú'
+    255, // 'Û'
+    255, // 'Ü'
+    255, // 'Ý'
+    255, // 'Þ'
+    255, // 'ß'
+    255, // 'à'
+    255, // 'á'
+    255, // 'â'
+    255, // 'ã'
+    255, // 'ä'
+    255, // 'å'
+    255, // 'æ'
+    255, // 'ç'
+    255, // 'è'
+    255, // 'é'
+    255, // 'ê'
+    255, // 'ë'
+    255, // 'ì'
+    255, // 'í'
+    255, // 'î'
+    255, // 'ï'
+    255, // 'ð'
+    255, // 'ñ'
+    255, // 'ò'
+    255, // 'ó'
+    255, // 'ô'
+    255, // 'õ'
+    255, // 'ö'
+    255, // '÷'
+    255, // 'ø'
+    255, // 'ù'
+    255, // 'ú'
+    255, // 'û'
+    255, // 'ü'
+    255, // 'ý'
+    255, // 'þ'
+    255, // 'ÿ'
+];
@@ -0,0 +1,264 @@
+use core::mem::size_of;
+
+use crate::memmem::{util::memcmp, vector::Vector, NeedleInfo};
+
+/// The minimum length of a needle required for this algorithm. The minimum
+/// is 2 since a length of 1 should just use memchr and a length of 0 isn't
+/// a case handled by this searcher.
+pub(crate) const MIN_NEEDLE_LEN: usize = 2;
+
+/// The maximum length of a needle required for this algorithm.
+///
+/// In reality, there is no hard max here. The code below can handle any
+/// length needle. (Perhaps that suggests there are missing optimizations.)
+/// Instead, this is a heuristic and a bound guaranteeing our linear time
+/// complexity.
+///
+/// It is a heuristic because when a candidate match is found, memcmp is run.
+/// For very large needles with lots of false positives, memcmp can make the
+/// code run quite slow.
+///
+/// It is a bound because the worst case behavior with memcmp is multiplicative
+/// in the size of the needle and haystack, and we want to keep that additive.
+/// This bound ensures we still meet that bound theoretically, since it's just
+/// a constant. We aren't acting in bad faith here, memcmp on tiny needles
+/// is so fast that even in pathological cases (see pathological vector
+/// benchmarks), this is still just as fast or faster in practice.
+///
+/// This specific number was chosen by tweaking a bit and running benchmarks.
+/// The rare-medium-needle, for example, gets about 5% faster by using this
+/// algorithm instead of a prefilter-accelerated Two-Way. There's also a
+/// theoretical desire to keep this number reasonably low, to mitigate the
+/// impact of pathological cases. I did try 64, and some benchmarks got a
+/// little better, and others (particularly the pathological ones), got a lot
+/// worse. So... 32 it is?
+pub(crate) const MAX_NEEDLE_LEN: usize = 32;
+
+/// The implementation of the forward vector accelerated substring search.
+///
+/// This is extremely similar to the prefilter vector module by the same name.
+/// The key difference is that this is not a prefilter. Instead, it handles
+/// confirming its own matches. The trade off is that this only works with
+/// smaller needles. The speed up here is that an inlined memcmp on a tiny
+/// needle is very quick, even on pathological inputs. This is much better than
+/// combining a prefilter with Two-Way, where using Two-Way to confirm the
+/// match has higher latency.
+///
+/// So why not use this for all needles? We could, and it would probably work
+/// really well on most inputs. But its worst case is multiplicative and we
+/// want to guarantee worst case additive time. Some of the benchmarks try to
+/// justify this (see the pathological ones).
+///
+/// The prefilter variant of this has more comments. Also note that we only
+/// implement this for forward searches for now. If you have a compelling use
+/// case for accelerated reverse search, please file an issue.
+#[derive(Clone, Copy, Debug)]
+pub(crate) struct Forward {
+    rare1i: u8,
+    rare2i: u8,
+}
+
+impl Forward {
+    /// Create a new "generic simd" forward searcher. If one could not be
+    /// created from the given inputs, then None is returned.
+    pub(crate) fn new(ninfo: &NeedleInfo, needle: &[u8]) -> Option<Forward> {
+        let (rare1i, rare2i) = ninfo.rarebytes.as_rare_ordered_u8();
+        // If the needle is too short or too long, give up. Also, give up
+        // if the rare bytes detected are at the same position. (It likely
+        // suggests a degenerate case, although it should technically not be
+        // possible.)
+        if needle.len() < MIN_NEEDLE_LEN
+            || needle.len() > MAX_NEEDLE_LEN
+            || rare1i == rare2i
+        {
+            return None;
+        }
+        Some(Forward { rare1i, rare2i })
+    }
+
+    /// Returns the minimum length of haystack that is needed for this searcher
+    /// to work for a particular vector. Passing a haystack with a length
+    /// smaller than this will cause `fwd_find` to panic.
+    #[inline(always)]
+    pub(crate) fn min_haystack_len<V: Vector>(&self) -> usize {
+        self.rare2i as usize + size_of::<V>()
+    }
+}
+
+/// Searches the given haystack for the given needle. The needle given should
+/// be the same as the needle that this searcher was initialized with.
+///
+/// # Panics
+///
+/// When the given haystack has a length smaller than `min_haystack_len`.
+///
+/// # Safety
+///
+/// Since this is meant to be used with vector functions, callers need to
+/// specialize this inside of a function with a `target_feature` attribute.
+/// Therefore, callers must ensure that whatever target feature is being used
+/// supports the vector functions that this function is specialized for. (For
+/// the specific vector functions used, see the Vector trait implementations.)
+#[inline(always)]
+pub(crate) unsafe fn fwd_find<V: Vector>(
+    fwd: &Forward,
+    haystack: &[u8],
+    needle: &[u8],
+) -> Option<usize> {
+    // It would be nice if we didn't have this check here, since the meta
+    // searcher should handle it for us. But without this, I don't think we
+    // guarantee that end_ptr.sub(needle.len()) won't result in UB. We could
+    // put it as part of the safety contract, but it makes it more complicated
+    // than necessary.
+    if haystack.len() < needle.len() {
+        return None;
+    }
+    let min_haystack_len = fwd.min_haystack_len::<V>();
+    assert!(haystack.len() >= min_haystack_len, "haystack too small");
+    debug_assert!(needle.len() <= haystack.len());
+    debug_assert!(
+        needle.len() >= MIN_NEEDLE_LEN,
+        "needle must be at least {} bytes",
+        MIN_NEEDLE_LEN,
+    );
+    debug_assert!(
+        needle.len() <= MAX_NEEDLE_LEN,
+        "needle must be at most {} bytes",
+        MAX_NEEDLE_LEN,
+    );
+
+    let (rare1i, rare2i) = (fwd.rare1i as usize, fwd.rare2i as usize);
+    let rare1chunk = V::splat(needle[rare1i]);
+    let rare2chunk = V::splat(needle[rare2i]);
+
+    let start_ptr = haystack.as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
+    let max_ptr = end_ptr.sub(min_haystack_len);
+    let mut ptr = start_ptr;
+
+    // N.B. I did experiment with unrolling the loop to deal with size(V)
+    // bytes at a time and 2*size(V) bytes at a time. The double unroll was
+    // marginally faster while the quadruple unroll was unambiguously slower.
+    // In the end, I decided the complexity from unrolling wasn't worth it. I
+    // used the memmem/krate/prebuilt/huge-en/ benchmarks to compare.
+    while ptr <= max_ptr {
+        let m = fwd_find_in_chunk(
+            fwd, needle, ptr, end_ptr, rare1chunk, rare2chunk, !0,
+        );
+        if let Some(chunki) = m {
+            return Some(matched(start_ptr, ptr, chunki));
+        }
+        ptr = ptr.add(size_of::<V>());
+    }
+    if ptr < end_ptr {
+        let remaining = diff(end_ptr, ptr);
+        debug_assert!(
+            remaining < min_haystack_len,
+            "remaining bytes should be smaller than the minimum haystack \
+             length of {}, but there are {} bytes remaining",
+            min_haystack_len,
+            remaining,
+        );
+        if remaining < needle.len() {
+            return None;
+        }
+        debug_assert!(
+            max_ptr < ptr,
+            "after main loop, ptr should have exceeded max_ptr",
+        );
+        let overlap = diff(ptr, max_ptr);
+        debug_assert!(
+            overlap > 0,
+            "overlap ({}) must always be non-zero",
+            overlap,
+        );
+        debug_assert!(
+            overlap < size_of::<V>(),
+            "overlap ({}) cannot possibly be >= than a vector ({})",
+            overlap,
+            size_of::<V>(),
+        );
+        // The mask has all of its bits set except for the first N least
+        // significant bits, where N=overlap. This way, any matches that
+        // occur in find_in_chunk within the overlap are automatically
+        // ignored.
+        let mask = !((1 << overlap) - 1);
+        ptr = max_ptr;
+        let m = fwd_find_in_chunk(
+            fwd, needle, ptr, end_ptr, rare1chunk, rare2chunk, mask,
+        );
+        if let Some(chunki) = m {
+            return Some(matched(start_ptr, ptr, chunki));
+        }
+    }
+    None
+}
+
+/// Search for an occurrence of two rare bytes from the needle in the chunk
+/// pointed to by ptr, with the end of the haystack pointed to by end_ptr.
+///
+/// rare1chunk and rare2chunk correspond to vectors with the rare1 and rare2
+/// bytes repeated in each 8-bit lane, respectively.
+///
+/// mask should have bits set corresponding the positions in the chunk in which
+/// matches are considered. This is only used for the last vector load where
+/// the beginning of the vector might have overlapped with the last load in
+/// the main loop. The mask lets us avoid visiting positions that have already
+/// been discarded as matches.
+///
+/// # Safety
+///
+/// It must be safe to do an unaligned read of size(V) bytes starting at both
+/// (ptr + rare1i) and (ptr + rare2i). It must also be safe to do unaligned
+/// loads on ptr up to end_ptr.
+#[inline(always)]
+unsafe fn fwd_find_in_chunk<V: Vector>(
+    fwd: &Forward,
+    needle: &[u8],
+    ptr: *const u8,
+    end_ptr: *const u8,
+    rare1chunk: V,
+    rare2chunk: V,
+    mask: u32,
+) -> Option<usize> {
+    let chunk0 = V::load_unaligned(ptr.add(fwd.rare1i as usize));
+    let chunk1 = V::load_unaligned(ptr.add(fwd.rare2i as usize));
+
+    let eq0 = chunk0.cmpeq(rare1chunk);
+    let eq1 = chunk1.cmpeq(rare2chunk);
+
+    let mut match_offsets = eq0.and(eq1).movemask() & mask;
+    while match_offsets != 0 {
+        let offset = match_offsets.trailing_zeros() as usize;
+        let ptr = ptr.add(offset);
+        if end_ptr.sub(needle.len()) < ptr {
+            return None;
+        }
+        let chunk = core::slice::from_raw_parts(ptr, needle.len());
+        if memcmp(needle, chunk) {
+            return Some(offset);
+        }
+        match_offsets &= match_offsets - 1;
+    }
+    None
+}
+
+/// Accepts a chunk-relative offset and returns a haystack relative offset
+/// after updating the prefilter state.
+///
+/// See the same function with the same name in the prefilter variant of this
+/// algorithm to learned why it's tagged with inline(never). Even here, where
+/// the function is simpler, inlining it leads to poorer codegen. (Although
+/// it does improve some benchmarks, like prebuiltiter/huge-en/common-you.)
+#[cold]
+#[inline(never)]
+fn matched(start_ptr: *const u8, ptr: *const u8, chunki: usize) -> usize {
+    diff(ptr, start_ptr) + chunki
+}
+
+/// Subtract `b` from `a` and return the difference. `a` must be greater than
+/// or equal to `b`.
+fn diff(a: *const u8, b: *const u8) -> usize {
+    debug_assert!(a >= b);
+    (a as usize) - (b as usize)
+}
@@ -0,0 +1,122 @@
+/*
+This module implements a "fallback" prefilter that only relies on memchr to
+function. While memchr works best when it's explicitly vectorized, its
+fallback implementations are fast enough to make a prefilter like this
+worthwhile.
+
+The essence of this implementation is to identify two rare bytes in a needle
+based on a background frequency distribution of bytes. We then run memchr on the
+rarer byte. For each match, we use the second rare byte as a guard to quickly
+check if a match is possible. If the position passes the guard test, then we do
+a naive memcmp to confirm the match.
+
+In practice, this formulation works amazingly well, primarily because of the
+heuristic use of a background frequency distribution. However, it does have a
+number of weaknesses where it can get quite slow when its background frequency
+distribution doesn't line up with the haystack being searched. This is why we
+have specialized vector routines that essentially take this idea and move the
+guard check into vectorized code. (Those specialized vector routines do still
+make use of the background frequency distribution of bytes though.)
+
+This fallback implementation was originally formulated in regex many moons ago:
+https://github.com/rust-lang/regex/blob/3db8722d0b204a85380fe2a65e13d7065d7dd968/src/literal/imp.rs#L370-L501
+Prior to that, I'm not aware of anyone using this technique in any prominant
+substring search implementation. Although, I'm sure folks have had this same
+insight long before me.
+
+Another version of this also appeared in bstr:
+https://github.com/BurntSushi/bstr/blob/a444256ca7407fe180ee32534688549655b7a38e/src/search/prefilter.rs#L83-L340
+*/
+
+use crate::memmem::{
+    prefilter::{PrefilterFnTy, PrefilterState},
+    NeedleInfo,
+};
+
+// Check that the functions below satisfy the Prefilter function type.
+const _: PrefilterFnTy = find;
+
+/// Look for a possible occurrence of needle. The position returned
+/// corresponds to the beginning of the occurrence, if one exists.
+///
+/// Callers may assume that this never returns false negatives (i.e., it
+/// never misses an actual occurrence), but must check that the returned
+/// position corresponds to a match. That is, it can return false
+/// positives.
+///
+/// This should only be used when Freqy is constructed for forward
+/// searching.
+pub(crate) fn find(
+    prestate: &mut PrefilterState,
+    ninfo: &NeedleInfo,
+    haystack: &[u8],
+    needle: &[u8],
+) -> Option<usize> {
+    let mut i = 0;
+    let (rare1i, rare2i) = ninfo.rarebytes.as_rare_usize();
+    let (rare1, rare2) = ninfo.rarebytes.as_rare_bytes(needle);
+    while prestate.is_effective() {
+        // Use a fast vectorized implementation to skip to the next
+        // occurrence of the rarest byte (heuristically chosen) in the
+        // needle.
+        let found = crate::memchr(rare1, &haystack[i..])?;
+        prestate.update(found);
+        i += found;
+
+        // If we can't align our first match with the haystack, then a
+        // match is impossible.
+        if i < rare1i {
+            i += 1;
+            continue;
+        }
+
+        // Align our rare2 byte with the haystack. A mismatch means that
+        // a match is impossible.
+        let aligned_rare2i = i - rare1i + rare2i;
+        if haystack.get(aligned_rare2i) != Some(&rare2) {
+            i += 1;
+            continue;
+        }
+
+        // We've done what we can. There might be a match here.
+        return Some(i - rare1i);
+    }
+    // The only way we get here is if we believe our skipping heuristic
+    // has become ineffective. We're allowed to return false positives,
+    // so return the position at which we advanced to, aligned to the
+    // haystack.
+    Some(i.saturating_sub(rare1i))
+}
+
+#[cfg(all(test, feature = "std"))]
+mod tests {
+    use super::*;
+
+    fn freqy_find(haystack: &[u8], needle: &[u8]) -> Option<usize> {
+        let ninfo = NeedleInfo::new(needle);
+        let mut prestate = PrefilterState::new();
+        find(&mut prestate, &ninfo, haystack, needle)
+    }
+
+    #[test]
+    fn freqy_forward() {
+        assert_eq!(Some(0), freqy_find(b"BARFOO", b"BAR"));
+        assert_eq!(Some(3), freqy_find(b"FOOBAR", b"BAR"));
+        assert_eq!(Some(0), freqy_find(b"zyzz", b"zyzy"));
+        assert_eq!(Some(2), freqy_find(b"zzzy", b"zyzy"));
+        assert_eq!(None, freqy_find(b"zazb", b"zyzy"));
+        assert_eq!(Some(0), freqy_find(b"yzyy", b"yzyz"));
+        assert_eq!(Some(2), freqy_find(b"yyyz", b"yzyz"));
+        assert_eq!(None, freqy_find(b"yayb", b"yzyz"));
+    }
+
+    #[test]
+    #[cfg(not(miri))]
+    fn prefilter_permutations() {
+        use crate::memmem::prefilter::tests::PrefilterTest;
+
+        // SAFETY: super::find is safe to call for all inputs and on all
+        // platforms.
+        unsafe { PrefilterTest::run_all_tests(super::find) };
+    }
+}
@@ -0,0 +1,207 @@
+use core::mem::size_of;
+
+use crate::memmem::{
+    prefilter::{PrefilterFnTy, PrefilterState},
+    vector::Vector,
+    NeedleInfo,
+};
+
+/// The implementation of the forward vector accelerated candidate finder.
+///
+/// This is inspired by the "generic SIMD" algorithm described here:
+/// http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd
+///
+/// The main difference is that this is just a prefilter. That is, it reports
+/// candidates once they are seen and doesn't attempt to confirm them. Also,
+/// the bytes this routine uses to check for candidates are selected based on
+/// an a priori background frequency distribution. This means that on most
+/// haystacks, this will on average spend more time in vectorized code than you
+/// would if you just selected the first and last bytes of the needle.
+///
+/// Note that a non-prefilter variant of this algorithm can be found in the
+/// parent module, but it only works on smaller needles.
+///
+/// `prestate`, `ninfo`, `haystack` and `needle` are the four prefilter
+/// function parameters. `fallback` is a prefilter that is used if the haystack
+/// is too small to be handled with the given vector size.
+///
+/// This routine is not safe because it is intended for callers to specialize
+/// this with a particular vector (e.g., __m256i) and then call it with the
+/// relevant target feature (e.g., avx2) enabled.
+///
+/// # Panics
+///
+/// If `needle.len() <= 1`, then this panics.
+///
+/// # Safety
+///
+/// Since this is meant to be used with vector functions, callers need to
+/// specialize this inside of a function with a `target_feature` attribute.
+/// Therefore, callers must ensure that whatever target feature is being used
+/// supports the vector functions that this function is specialized for. (For
+/// the specific vector functions used, see the Vector trait implementations.)
+#[inline(always)]
+pub(crate) unsafe fn find<V: Vector>(
+    prestate: &mut PrefilterState,
+    ninfo: &NeedleInfo,
+    haystack: &[u8],
+    needle: &[u8],
+    fallback: PrefilterFnTy,
+) -> Option<usize> {
+    assert!(needle.len() >= 2, "needle must be at least 2 bytes");
+    let (rare1i, rare2i) = ninfo.rarebytes.as_rare_ordered_usize();
+    let min_haystack_len = rare2i + size_of::<V>();
+    if haystack.len() < min_haystack_len {
+        return fallback(prestate, ninfo, haystack, needle);
+    }
+
+    let start_ptr = haystack.as_ptr();
+    let end_ptr = start_ptr.add(haystack.len());
+    let max_ptr = end_ptr.sub(min_haystack_len);
+    let mut ptr = start_ptr;
+
+    let rare1chunk = V::splat(needle[rare1i]);
+    let rare2chunk = V::splat(needle[rare2i]);
+
+    // N.B. I did experiment with unrolling the loop to deal with size(V)
+    // bytes at a time and 2*size(V) bytes at a time. The double unroll
+    // was marginally faster while the quadruple unroll was unambiguously
+    // slower. In the end, I decided the complexity from unrolling wasn't
+    // worth it. I used the memmem/krate/prebuilt/huge-en/ benchmarks to
+    // compare.
+    while ptr <= max_ptr {
+        let m = find_in_chunk2(ptr, rare1i, rare2i, rare1chunk, rare2chunk);
+        if let Some(chunki) = m {
+            return Some(matched(prestate, start_ptr, ptr, chunki));
+        }
+        ptr = ptr.add(size_of::<V>());
+    }
+    if ptr < end_ptr {
+        // This routine immediately quits if a candidate match is found.
+        // That means that if we're here, no candidate matches have been
+        // found at or before 'ptr'. Thus, we don't need to mask anything
+        // out even though we might technically search part of the haystack
+        // that we've already searched (because we know it can't match).
+        ptr = max_ptr;
+        let m = find_in_chunk2(ptr, rare1i, rare2i, rare1chunk, rare2chunk);
+        if let Some(chunki) = m {
+            return Some(matched(prestate, start_ptr, ptr, chunki));
+        }
+    }
+    prestate.update(haystack.len());
+    None
+}
+
+// Below are two different techniques for checking whether a candidate
+// match exists in a given chunk or not. find_in_chunk2 checks two bytes
+// where as find_in_chunk3 checks three bytes. The idea behind checking
+// three bytes is that while we do a bit more work per iteration, we
+// decrease the chances of a false positive match being reported and thus
+// make the search faster overall. This actually works out for the
+// memmem/krate/prebuilt/huge-en/never-all-common-bytes benchmark, where
+// using find_in_chunk3 is about 25% faster than find_in_chunk2. However,
+// it turns out that find_in_chunk2 is faster for all other benchmarks, so
+// perhaps the extra check isn't worth it in practice.
+//
+// For now, we go with find_in_chunk2, but we leave find_in_chunk3 around
+// to make it easy to switch to and benchmark when possible.
+
+/// Search for an occurrence of two rare bytes from the needle in the current
+/// chunk pointed to by ptr.
+///
+/// rare1chunk and rare2chunk correspond to vectors with the rare1 and rare2
+/// bytes repeated in each 8-bit lane, respectively.
+///
+/// # Safety
+///
+/// It must be safe to do an unaligned read of size(V) bytes starting at both
+/// (ptr + rare1i) and (ptr + rare2i).
+#[inline(always)]
+unsafe fn find_in_chunk2<V: Vector>(
+    ptr: *const u8,
+    rare1i: usize,
+    rare2i: usize,
+    rare1chunk: V,
+    rare2chunk: V,
+) -> Option<usize> {
+    let chunk0 = V::load_unaligned(ptr.add(rare1i));
+    let chunk1 = V::load_unaligned(ptr.add(rare2i));
+
+    let eq0 = chunk0.cmpeq(rare1chunk);
+    let eq1 = chunk1.cmpeq(rare2chunk);
+
+    let match_offsets = eq0.and(eq1).movemask();
+    if match_offsets == 0 {
+        return None;
+    }
+    Some(match_offsets.trailing_zeros() as usize)
+}
+
+/// Search for an occurrence of two rare bytes and the first byte (even if one
+/// of the rare bytes is equivalent to the first byte) from the needle in the
+/// current chunk pointed to by ptr.
+///
+/// firstchunk, rare1chunk and rare2chunk correspond to vectors with the first,
+/// rare1 and rare2 bytes repeated in each 8-bit lane, respectively.
+///
+/// # Safety
+///
+/// It must be safe to do an unaligned read of size(V) bytes starting at ptr,
+/// (ptr + rare1i) and (ptr + rare2i).
+#[allow(dead_code)]
+#[inline(always)]
+unsafe fn find_in_chunk3<V: Vector>(
+    ptr: *const u8,
+    rare1i: usize,
+    rare2i: usize,
+    firstchunk: V,
+    rare1chunk: V,
+    rare2chunk: V,
+) -> Option<usize> {
+    let chunk0 = V::load_unaligned(ptr);
+    let chunk1 = V::load_unaligned(ptr.add(rare1i));
+    let chunk2 = V::load_unaligned(ptr.add(rare2i));
+
+    let eq0 = chunk0.cmpeq(firstchunk);
+    let eq1 = chunk1.cmpeq(rare1chunk);
+    let eq2 = chunk2.cmpeq(rare2chunk);
+
+    let match_offsets = eq0.and(eq1).and(eq2).movemask();
+    if match_offsets == 0 {
+        return None;
+    }
+    Some(match_offsets.trailing_zeros() as usize)
+}
+
+/// Accepts a chunk-relative offset and returns a haystack relative offset
+/// after updating the prefilter state.
+///
+/// Why do we use this unlineable function when a search completes? Well,
+/// I don't know. Really. Obviously this function was not here initially.
+/// When doing profiling, the codegen for the inner loop here looked bad and
+/// I didn't know why. There were a couple extra 'add' instructions and an
+/// extra 'lea' instruction that I couldn't explain. I hypothesized that the
+/// optimizer was having trouble untangling the hot code in the loop from the
+/// code that deals with a candidate match. By putting the latter into an
+/// unlineable function, it kind of forces the issue and it had the intended
+/// effect: codegen improved measurably. It's good for a ~10% improvement
+/// across the board on the memmem/krate/prebuilt/huge-en/ benchmarks.
+#[cold]
+#[inline(never)]
+fn matched(
+    prestate: &mut PrefilterState,
+    start_ptr: *const u8,
+    ptr: *const u8,
+    chunki: usize,
+) -> usize {
+    let found = diff(ptr, start_ptr) + chunki;
+    prestate.update(found);
+    found
+}
+
+/// Subtract `b` from `a` and return the difference. `a` must be greater than
+/// or equal to `b`.
+fn diff(a: *const u8, b: *const u8) -> usize {
+    debug_assert!(a >= b);
+    (a as usize) - (b as usize)
+}
@@ -0,0 +1,562 @@
+use crate::memmem::{rarebytes::RareNeedleBytes, NeedleInfo};
+
+mod fallback;
+#[cfg(all(target_arch = "x86_64", memchr_runtime_simd))]
+mod genericsimd;
+#[cfg(all(not(miri), target_arch = "x86_64", memchr_runtime_simd))]
+mod x86;
+
+/// The maximum frequency rank permitted for the fallback prefilter. If the
+/// rarest byte in the needle has a frequency rank above this value, then no
+/// prefilter is used if the fallback prefilter would otherwise be selected.
+const MAX_FALLBACK_RANK: usize = 250;
+
+/// A combination of prefilter effectiveness state, the prefilter function and
+/// the needle info required to run a prefilter.
+///
+/// For the most part, these are grouped into a single type for convenience,
+/// instead of needing to pass around all three as distinct function
+/// parameters.
+pub(crate) struct Pre<'a> {
+    /// State that tracks the effectivess of a prefilter.
+    pub(crate) state: &'a mut PrefilterState,
+    /// The actual prefilter function.
+    pub(crate) prefn: PrefilterFn,
+    /// Information about a needle, such as its RK hash and rare byte offsets.
+    pub(crate) ninfo: &'a NeedleInfo,
+}
+
+impl<'a> Pre<'a> {
+    /// Call this prefilter on the given haystack with the given needle.
+    #[inline(always)]
+    pub(crate) fn call(
+        &mut self,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        self.prefn.call(self.state, self.ninfo, haystack, needle)
+    }
+
+    /// Return true if and only if this prefilter should be used.
+    #[inline(always)]
+    pub(crate) fn should_call(&mut self) -> bool {
+        self.state.is_effective()
+    }
+}
+
+/// A prefilter function.
+///
+/// A prefilter function describes both forward and reverse searches.
+/// (Although, we don't currently implement prefilters for reverse searching.)
+/// In the case of a forward search, the position returned corresponds to
+/// the starting offset of a match (confirmed or possible). Its minimum
+/// value is `0`, and its maximum value is `haystack.len() - 1`. In the case
+/// of a reverse search, the position returned corresponds to the position
+/// immediately after a match (confirmed or possible). Its minimum value is `1`
+/// and its maximum value is `haystack.len()`.
+///
+/// In both cases, the position returned is the starting (or ending) point of a
+/// _possible_ match. That is, returning a false positive is okay. A prefilter,
+/// however, must never return any false negatives. That is, if a match exists
+/// at a particular position `i`, then a prefilter _must_ return that position.
+/// It cannot skip past it.
+///
+/// # Safety
+///
+/// A prefilter function is not safe to create, since not all prefilters are
+/// safe to call in all contexts. (e.g., A prefilter that uses AVX instructions
+/// may only be called on x86_64 CPUs with the relevant AVX feature enabled.)
+/// Thus, callers must ensure that when a prefilter function is created that it
+/// is safe to call for the current environment.
+#[derive(Clone, Copy)]
+pub(crate) struct PrefilterFn(PrefilterFnTy);
+
+/// The type of a prefilter function. All prefilters must satisfy this
+/// signature.
+///
+/// Using a function pointer like this does inhibit inlining, but it does
+/// eliminate branching and the extra costs associated with copying a larger
+/// enum. Note also, that using Box<dyn SomePrefilterTrait> can't really work
+/// here, since we want to work in contexts that don't have dynamic memory
+/// allocation. Moreover, in the default configuration of this crate on x86_64
+/// CPUs released in the past ~decade, we will use an AVX2-optimized prefilter,
+/// which generally won't be inlineable into the surrounding code anyway.
+/// (Unless AVX2 is enabled at compile time, but this is typically rare, since
+/// it produces a non-portable binary.)
+pub(crate) type PrefilterFnTy = unsafe fn(
+    prestate: &mut PrefilterState,
+    ninfo: &NeedleInfo,
+    haystack: &[u8],
+    needle: &[u8],
+) -> Option<usize>;
+
+impl PrefilterFn {
+    /// Create a new prefilter function from the function pointer given.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that the given prefilter function is safe to call
+    /// for all inputs in the current environment. For example, if the given
+    /// prefilter function uses AVX instructions, then the caller must ensure
+    /// that the appropriate AVX CPU features are enabled.
+    pub(crate) unsafe fn new(prefn: PrefilterFnTy) -> PrefilterFn {
+        PrefilterFn(prefn)
+    }
+
+    /// Call the underlying prefilter function with the given arguments.
+    pub fn call(
+        self,
+        prestate: &mut PrefilterState,
+        ninfo: &NeedleInfo,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        // SAFETY: Callers have the burden of ensuring that a prefilter
+        // function is safe to call for all inputs in the current environment.
+        unsafe { (self.0)(prestate, ninfo, haystack, needle) }
+    }
+}
+
+impl core::fmt::Debug for PrefilterFn {
+    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
+        "<prefilter-fn(...)>".fmt(f)
+    }
+}
+
+/// Prefilter controls whether heuristics are used to accelerate searching.
+///
+/// A prefilter refers to the idea of detecting candidate matches very quickly,
+/// and then confirming whether those candidates are full matches. This
+/// idea can be quite effective since it's often the case that looking for
+/// candidates can be a lot faster than running a complete substring search
+/// over the entire input. Namely, looking for candidates can be done with
+/// extremely fast vectorized code.
+///
+/// The downside of a prefilter is that it assumes false positives (which are
+/// candidates generated by a prefilter that aren't matches) are somewhat rare
+/// relative to the frequency of full matches. That is, if a lot of false
+/// positives are generated, then it's possible for search time to be worse
+/// than if the prefilter wasn't enabled in the first place.
+///
+/// Another downside of a prefilter is that it can result in highly variable
+/// performance, where some cases are extraordinarily fast and others aren't.
+/// Typically, variable performance isn't a problem, but it may be for your use
+/// case.
+///
+/// The use of prefilters in this implementation does use a heuristic to detect
+/// when a prefilter might not be carrying its weight, and will dynamically
+/// disable its use. Nevertheless, this configuration option gives callers
+/// the ability to disable pefilters if you have knowledge that they won't be
+/// useful.
+#[derive(Clone, Copy, Debug)]
+#[non_exhaustive]
+pub enum Prefilter {
+    /// Never used a prefilter in substring search.
+    None,
+    /// Automatically detect whether a heuristic prefilter should be used. If
+    /// it is used, then heuristics will be used to dynamically disable the
+    /// prefilter if it is believed to not be carrying its weight.
+    Auto,
+}
+
+impl Default for Prefilter {
+    fn default() -> Prefilter {
+        Prefilter::Auto
+    }
+}
+
+impl Prefilter {
+    pub(crate) fn is_none(&self) -> bool {
+        match *self {
+            Prefilter::None => true,
+            _ => false,
+        }
+    }
+}
+
+/// PrefilterState tracks state associated with the effectiveness of a
+/// prefilter. It is used to track how many bytes, on average, are skipped by
+/// the prefilter. If this average dips below a certain threshold over time,
+/// then the state renders the prefilter inert and stops using it.
+///
+/// A prefilter state should be created for each search. (Where creating an
+/// iterator is treated as a single search.) A prefilter state should only be
+/// created from a `Freqy`. e.g., An inert `Freqy` will produce an inert
+/// `PrefilterState`.
+#[derive(Clone, Debug)]
+pub(crate) struct PrefilterState {
+    /// The number of skips that has been executed. This is always 1 greater
+    /// than the actual number of skips. The special sentinel value of 0
+    /// indicates that the prefilter is inert. This is useful to avoid
+    /// additional checks to determine whether the prefilter is still
+    /// "effective." Once a prefilter becomes inert, it should no longer be
+    /// used (according to our heuristics).
+    skips: u32,
+    /// The total number of bytes that have been skipped.
+    skipped: u32,
+}
+
+impl PrefilterState {
+    /// The minimum number of skip attempts to try before considering whether
+    /// a prefilter is effective or not.
+    const MIN_SKIPS: u32 = 50;
+
+    /// The minimum amount of bytes that skipping must average.
+    ///
+    /// This value was chosen based on varying it and checking
+    /// the microbenchmarks. In particular, this can impact the
+    /// pathological/repeated-{huge,small} benchmarks quite a bit if it's set
+    /// too low.
+    const MIN_SKIP_BYTES: u32 = 8;
+
+    /// Create a fresh prefilter state.
+    pub(crate) fn new() -> PrefilterState {
+        PrefilterState { skips: 1, skipped: 0 }
+    }
+
+    /// Create a fresh prefilter state that is always inert.
+    pub(crate) fn inert() -> PrefilterState {
+        PrefilterState { skips: 0, skipped: 0 }
+    }
+
+    /// Update this state with the number of bytes skipped on the last
+    /// invocation of the prefilter.
+    #[inline]
+    pub(crate) fn update(&mut self, skipped: usize) {
+        self.skips = self.skips.saturating_add(1);
+        // We need to do this dance since it's technically possible for
+        // `skipped` to overflow a `u32`. (And we use a `u32` to reduce the
+        // size of a prefilter state.)
+        if skipped > core::u32::MAX as usize {
+            self.skipped = core::u32::MAX;
+        } else {
+            self.skipped = self.skipped.saturating_add(skipped as u32);
+        }
+    }
+
+    /// Return true if and only if this state indicates that a prefilter is
+    /// still effective.
+    #[inline]
+    pub(crate) fn is_effective(&mut self) -> bool {
+        if self.is_inert() {
+            return false;
+        }
+        if self.skips() < PrefilterState::MIN_SKIPS {
+            return true;
+        }
+        if self.skipped >= PrefilterState::MIN_SKIP_BYTES * self.skips() {
+            return true;
+        }
+
+        // We're inert.
+        self.skips = 0;
+        false
+    }
+
+    #[inline]
+    fn is_inert(&self) -> bool {
+        self.skips == 0
+    }
+
+    #[inline]
+    fn skips(&self) -> u32 {
+        self.skips.saturating_sub(1)
+    }
+}
+
+/// Determine which prefilter function, if any, to use.
+///
+/// This only applies to x86_64 when runtime SIMD detection is enabled (which
+/// is the default). In general, we try to use an AVX prefilter, followed by
+/// SSE and then followed by a generic one based on memchr.
+#[cfg(all(not(miri), target_arch = "x86_64", memchr_runtime_simd))]
+#[inline(always)]
+pub(crate) fn forward(
+    config: &Prefilter,
+    rare: &RareNeedleBytes,
+    needle: &[u8],
+) -> Option<PrefilterFn> {
+    if config.is_none() || needle.len() <= 1 {
+        return None;
+    }
+
+    #[cfg(feature = "std")]
+    {
+        if cfg!(memchr_runtime_avx) {
+            if is_x86_feature_detected!("avx2") {
+                // SAFETY: x86::avx::find only requires the avx2 feature,
+                // which we've just checked above.
+                return unsafe { Some(PrefilterFn::new(x86::avx::find)) };
+            }
+        }
+    }
+    if cfg!(memchr_runtime_sse2) {
+        // SAFETY: x86::sse::find only requires the sse2 feature, which is
+        // guaranteed to be available on x86_64.
+        return unsafe { Some(PrefilterFn::new(x86::sse::find)) };
+    }
+    // Check that our rarest byte has a reasonably low rank. The main issue
+    // here is that the fallback prefilter can perform pretty poorly if it's
+    // given common bytes. So we try to avoid the worst cases here.
+    let (rare1_rank, _) = rare.as_ranks(needle);
+    if rare1_rank <= MAX_FALLBACK_RANK {
+        // SAFETY: fallback::find is safe to call in all environments.
+        return unsafe { Some(PrefilterFn::new(fallback::find)) };
+    }
+    None
+}
+
+/// Determine which prefilter function, if any, to use.
+///
+/// Since SIMD is currently only supported on x86_64, this will just select
+/// the fallback prefilter if the rare bytes provided have a low enough rank.
+#[cfg(not(all(not(miri), target_arch = "x86_64", memchr_runtime_simd)))]
+#[inline(always)]
+pub(crate) fn forward(
+    config: &Prefilter,
+    rare: &RareNeedleBytes,
+    needle: &[u8],
+) -> Option<PrefilterFn> {
+    if config.is_none() || needle.len() <= 1 {
+        return None;
+    }
+    let (rare1_rank, _) = rare.as_ranks(needle);
+    if rare1_rank <= MAX_FALLBACK_RANK {
+        // SAFETY: fallback::find is safe to call in all environments.
+        return unsafe { Some(PrefilterFn::new(fallback::find)) };
+    }
+    None
+}
+
+/// Return the minimum length of the haystack in which a prefilter should be
+/// used. If the haystack is below this length, then it's probably not worth
+/// the overhead of running the prefilter.
+///
+/// We used to look at the length of a haystack here. That is, if it was too
+/// small, then don't bother with the prefilter. But two things changed:
+/// the prefilter falls back to memchr for small haystacks, and, at the
+/// meta-searcher level, Rabin-Karp is employed for tiny haystacks anyway.
+///
+/// We keep it around for now in case we want to bring it back.
+#[allow(dead_code)]
+pub(crate) fn minimum_len(_haystack: &[u8], needle: &[u8]) -> usize {
+    // If the haystack length isn't greater than needle.len() * FACTOR, then
+    // no prefilter will be used. The presumption here is that since there
+    // are so few bytes to check, it's not worth running the prefilter since
+    // there will need to be a validation step anyway. Thus, the prefilter is
+    // largely redundant work.
+    //
+    // Increase the factor noticeably hurts the
+    // memmem/krate/prebuilt/teeny-*/never-john-watson benchmarks.
+    const PREFILTER_LENGTH_FACTOR: usize = 2;
+    const VECTOR_MIN_LENGTH: usize = 16;
+    let min = core::cmp::max(
+        VECTOR_MIN_LENGTH,
+        PREFILTER_LENGTH_FACTOR * needle.len(),
+    );
+    // For haystacks with length==min, we still want to avoid the prefilter,
+    // so add 1.
+    min + 1
+}
+
+#[cfg(all(test, feature = "std", not(miri)))]
+pub(crate) mod tests {
+    use std::convert::{TryFrom, TryInto};
+
+    use super::*;
+    use crate::memmem::{
+        prefilter::PrefilterFnTy, rabinkarp, rarebytes::RareNeedleBytes,
+    };
+
+    // Below is a small jig that generates prefilter tests. The main purpose
+    // of this jig is to generate tests of varying needle/haystack lengths
+    // in order to try and exercise all code paths in our prefilters. And in
+    // particular, this is especially important for vectorized prefilters where
+    // certain code paths might only be exercised at certain lengths.
+
+    /// A test that represents the input and expected output to a prefilter
+    /// function. The test should be able to run with any prefilter function
+    /// and get the expected output.
+    pub(crate) struct PrefilterTest {
+        // These fields represent the inputs and expected output of a forwards
+        // prefilter function.
+        pub(crate) ninfo: NeedleInfo,
+        pub(crate) haystack: Vec<u8>,
+        pub(crate) needle: Vec<u8>,
+        pub(crate) output: Option<usize>,
+    }
+
+    impl PrefilterTest {
+        /// Run all generated forward prefilter tests on the given prefn.
+        ///
+        /// # Safety
+        ///
+        /// Callers must ensure that the given prefilter function pointer is
+        /// safe to call for all inputs in the current environment.
+        pub(crate) unsafe fn run_all_tests(prefn: PrefilterFnTy) {
+            PrefilterTest::run_all_tests_filter(prefn, |_| true)
+        }
+
+        /// Run all generated forward prefilter tests that pass the given
+        /// predicate on the given prefn.
+        ///
+        /// # Safety
+        ///
+        /// Callers must ensure that the given prefilter function pointer is
+        /// safe to call for all inputs in the current environment.
+        pub(crate) unsafe fn run_all_tests_filter(
+            prefn: PrefilterFnTy,
+            mut predicate: impl FnMut(&PrefilterTest) -> bool,
+        ) {
+            for seed in PREFILTER_TEST_SEEDS {
+                for test in seed.generate() {
+                    if predicate(&test) {
+                        test.run(prefn);
+                    }
+                }
+            }
+        }
+
+        /// Create a new prefilter test from a seed and some chose offsets to
+        /// rare bytes in the seed's needle.
+        ///
+        /// If a valid test could not be constructed, then None is returned.
+        /// (Currently, we take the approach of massaging tests to be valid
+        /// instead of rejecting them outright.)
+        fn new(
+            seed: &PrefilterTestSeed,
+            rare1i: usize,
+            rare2i: usize,
+            haystack_len: usize,
+            needle_len: usize,
+            output: Option<usize>,
+        ) -> Option<PrefilterTest> {
+            let mut rare1i: u8 = rare1i.try_into().unwrap();
+            let mut rare2i: u8 = rare2i.try_into().unwrap();
+            // The '#' byte is never used in a haystack (unless we're expecting
+            // a match), while the '@' byte is never used in a needle.
+            let mut haystack = vec![b'@'; haystack_len];
+            let mut needle = vec![b'#'; needle_len];
+            needle[0] = seed.first;
+            needle[rare1i as usize] = seed.rare1;
+            needle[rare2i as usize] = seed.rare2;
+            // If we're expecting a match, then make sure the needle occurs
+            // in the haystack at the expected position.
+            if let Some(i) = output {
+                haystack[i..i + needle.len()].copy_from_slice(&needle);
+            }
+            // If the operations above lead to rare offsets pointing to the
+            // non-first occurrence of a byte, then adjust it. This might lead
+            // to redundant tests, but it's simpler than trying to change the
+            // generation process I think.
+            if let Some(i) = crate::memchr(seed.rare1, &needle) {
+                rare1i = u8::try_from(i).unwrap();
+            }
+            if let Some(i) = crate::memchr(seed.rare2, &needle) {
+                rare2i = u8::try_from(i).unwrap();
+            }
+            let ninfo = NeedleInfo {
+                rarebytes: RareNeedleBytes::new(rare1i, rare2i),
+                nhash: rabinkarp::NeedleHash::forward(&needle),
+            };
+            Some(PrefilterTest { ninfo, haystack, needle, output })
+        }
+
+        /// Run this specific test on the given prefilter function. If the
+        /// outputs do no match, then this routine panics with a failure
+        /// message.
+        ///
+        /// # Safety
+        ///
+        /// Callers must ensure that the given prefilter function pointer is
+        /// safe to call for all inputs in the current environment.
+        unsafe fn run(&self, prefn: PrefilterFnTy) {
+            let mut prestate = PrefilterState::new();
+            assert_eq!(
+                self.output,
+                prefn(
+                    &mut prestate,
+                    &self.ninfo,
+                    &self.haystack,
+                    &self.needle
+                ),
+                "ninfo: {:?}, haystack(len={}): {:?}, needle(len={}): {:?}",
+                self.ninfo,
+                self.haystack.len(),
+                std::str::from_utf8(&self.haystack).unwrap(),
+                self.needle.len(),
+                std::str::from_utf8(&self.needle).unwrap(),
+            );
+        }
+    }
+
+    /// A set of prefilter test seeds. Each seed serves as the base for the
+    /// generation of many other tests. In essence, the seed captures the
+    /// "rare" and first bytes among our needle. The tests generated from each
+    /// seed essentially vary the length of the needle and haystack, while
+    /// using the rare/first byte configuration from the seed.
+    ///
+    /// The purpose of this is to test many different needle/haystack lengths.
+    /// In particular, some of the vector optimizations might only have bugs
+    /// in haystacks of a certain size.
+    const PREFILTER_TEST_SEEDS: &[PrefilterTestSeed] = &[
+        PrefilterTestSeed { first: b'x', rare1: b'y', rare2: b'z' },
+        PrefilterTestSeed { first: b'x', rare1: b'x', rare2: b'z' },
+        PrefilterTestSeed { first: b'x', rare1: b'y', rare2: b'x' },
+        PrefilterTestSeed { first: b'x', rare1: b'x', rare2: b'x' },
+        PrefilterTestSeed { first: b'x', rare1: b'y', rare2: b'y' },
+    ];
+
+    /// Data that describes a single prefilter test seed.
+    struct PrefilterTestSeed {
+        first: u8,
+        rare1: u8,
+        rare2: u8,
+    }
+
+    impl PrefilterTestSeed {
+        /// Generate a series of prefilter tests from this seed.
+        fn generate(&self) -> Vec<PrefilterTest> {
+            let mut tests = vec![];
+            let mut push = |test: Option<PrefilterTest>| {
+                if let Some(test) = test {
+                    tests.push(test);
+                }
+            };
+            let len_start = 2;
+            // The loop below generates *a lot* of tests. The number of tests
+            // was chosen somewhat empirically to be "bearable" when running
+            // the test suite.
+            for needle_len in len_start..=40 {
+                let rare_start = len_start - 1;
+                for rare1i in rare_start..needle_len {
+                    for rare2i in rare1i..needle_len {
+                        for haystack_len in needle_len..=66 {
+                            push(PrefilterTest::new(
+                                self,
+                                rare1i,
+                                rare2i,
+                                haystack_len,
+                                needle_len,
+                                None,
+                            ));
+                            // Test all possible match scenarios for this
+                            // needle and haystack.
+                            for output in 0..=(haystack_len - needle_len) {
+                                push(PrefilterTest::new(
+                                    self,
+                                    rare1i,
+                                    rare2i,
+                                    haystack_len,
+                                    needle_len,
+                                    Some(output),
+                                ));
+                            }
+                        }
+                    }
+                }
+            }
+            tests
+        }
+    }
+}
@@ -0,0 +1,46 @@
+use core::arch::x86_64::__m256i;
+
+use crate::memmem::{
+    prefilter::{PrefilterFnTy, PrefilterState},
+    NeedleInfo,
+};
+
+// Check that the functions below satisfy the Prefilter function type.
+const _: PrefilterFnTy = find;
+
+/// An AVX2 accelerated candidate finder for single-substring search.
+///
+/// # Safety
+///
+/// Callers must ensure that the avx2 CPU feature is enabled in the current
+/// environment.
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn find(
+    prestate: &mut PrefilterState,
+    ninfo: &NeedleInfo,
+    haystack: &[u8],
+    needle: &[u8],
+) -> Option<usize> {
+    super::super::genericsimd::find::<__m256i>(
+        prestate,
+        ninfo,
+        haystack,
+        needle,
+        super::sse::find,
+    )
+}
+
+#[cfg(test)]
+mod tests {
+    #[test]
+    #[cfg(not(miri))]
+    fn prefilter_permutations() {
+        use crate::memmem::prefilter::tests::PrefilterTest;
+        if !is_x86_feature_detected!("avx2") {
+            return;
+        }
+        // SAFETY: The safety of super::find only requires that the current
+        // CPU support AVX2, which we checked above.
+        unsafe { PrefilterTest::run_all_tests(super::find) };
+    }
+}
@@ -0,0 +1,5 @@
+// We only use AVX when we can detect at runtime whether it's available, which
+// requires std.
+#[cfg(feature = "std")]
+pub(crate) mod avx;
+pub(crate) mod sse;
@@ -0,0 +1,55 @@
+use core::arch::x86_64::__m128i;
+
+use crate::memmem::{
+    prefilter::{PrefilterFnTy, PrefilterState},
+    NeedleInfo,
+};
+
+// Check that the functions below satisfy the Prefilter function type.
+const _: PrefilterFnTy = find;
+
+/// An SSE2 accelerated candidate finder for single-substring search.
+///
+/// # Safety
+///
+/// Callers must ensure that the sse2 CPU feature is enabled in the current
+/// environment. This feature should be enabled in all x86_64 targets.
+#[target_feature(enable = "sse2")]
+pub(crate) unsafe fn find(
+    prestate: &mut PrefilterState,
+    ninfo: &NeedleInfo,
+    haystack: &[u8],
+    needle: &[u8],
+) -> Option<usize> {
+    // If the haystack is too small for SSE2, then just run memchr on the
+    // rarest byte and be done with it. (It is likely that this code path is
+    // rarely exercised, since a higher level routine will probably dispatch to
+    // Rabin-Karp for such a small haystack.)
+    fn simple_memchr_fallback(
+        _prestate: &mut PrefilterState,
+        ninfo: &NeedleInfo,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        let (rare, _) = ninfo.rarebytes.as_rare_ordered_usize();
+        crate::memchr(needle[rare], haystack).map(|i| i.saturating_sub(rare))
+    }
+    super::super::genericsimd::find::<__m128i>(
+        prestate,
+        ninfo,
+        haystack,
+        needle,
+        simple_memchr_fallback,
+    )
+}
+
+#[cfg(all(test, feature = "std"))]
+mod tests {
+    #[test]
+    #[cfg(not(miri))]
+    fn prefilter_permutations() {
+        use crate::memmem::prefilter::tests::PrefilterTest;
+        // SAFETY: super::find is safe to call for all inputs on x86.
+        unsafe { PrefilterTest::run_all_tests(super::find) };
+    }
+}
@@ -0,0 +1,233 @@
+/*
+This module implements the classical Rabin-Karp substring search algorithm,
+with no extra frills. While its use would seem to break our time complexity
+guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only
+ever use RK on a constant subset of haystacks. The main point here is that
+RK has good latency properties for small needles/haystacks. It's very quick
+to compute a needle hash and zip through the haystack when compared to
+initializing Two-Way, for example. And this is especially useful for cases
+where the haystack is just too short for vector instructions to do much good.
+
+The hashing function used here is the same one recommended by ESMAJ.
+
+Another choice instead of Rabin-Karp would be Shift-Or. But its latency
+isn't quite as good since its preprocessing time is a bit more expensive
+(both in practice and in theory). However, perhaps Shift-Or has a place
+somewhere else for short patterns. I think the main problem is that it
+requires space proportional to the alphabet and the needle. If we, for
+example, supported needles up to length 16, then the total table size would be
+len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's
+probably bad to put that on the stack. So ideally, we'd throw it on the heap,
+but we'd really like to write as much code without using alloc/std as possible.
+But maybe it's worth the special casing. It's a TODO to benchmark.
+
+Wikipedia has a decent explanation, if a bit heavy on the theory:
+https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
+
+But ESMAJ provides something a bit more concrete:
+http://www-igm.univ-mlv.fr/~lecroq/string/node5.html
+
+Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases:
+https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs
+*/
+
+/// Whether RK is believed to be very fast for the given needle/haystack.
+pub(crate) fn is_fast(haystack: &[u8], _needle: &[u8]) -> bool {
+    haystack.len() < 16
+}
+
+/// Search for the first occurrence of needle in haystack using Rabin-Karp.
+pub(crate) fn find(haystack: &[u8], needle: &[u8]) -> Option<usize> {
+    find_with(&NeedleHash::forward(needle), haystack, needle)
+}
+
+/// Search for the first occurrence of needle in haystack using Rabin-Karp with
+/// a pre-computed needle hash.
+pub(crate) fn find_with(
+    nhash: &NeedleHash,
+    mut haystack: &[u8],
+    needle: &[u8],
+) -> Option<usize> {
+    if haystack.len() < needle.len() {
+        return None;
+    }
+    let start = haystack.as_ptr() as usize;
+    let mut hash = Hash::from_bytes_fwd(&haystack[..needle.len()]);
+    // N.B. I've experimented with unrolling this loop, but couldn't realize
+    // any obvious gains.
+    loop {
+        if nhash.eq(hash) && is_prefix(haystack, needle) {
+            return Some(haystack.as_ptr() as usize - start);
+        }
+        if needle.len() >= haystack.len() {
+            return None;
+        }
+        hash.roll(&nhash, haystack[0], haystack[needle.len()]);
+        haystack = &haystack[1..];
+    }
+}
+
+/// Search for the last occurrence of needle in haystack using Rabin-Karp.
+pub(crate) fn rfind(haystack: &[u8], needle: &[u8]) -> Option<usize> {
+    rfind_with(&NeedleHash::reverse(needle), haystack, needle)
+}
+
+/// Search for the last occurrence of needle in haystack using Rabin-Karp with
+/// a pre-computed needle hash.
+pub(crate) fn rfind_with(
+    nhash: &NeedleHash,
+    mut haystack: &[u8],
+    needle: &[u8],
+) -> Option<usize> {
+    if haystack.len() < needle.len() {
+        return None;
+    }
+    let mut hash =
+        Hash::from_bytes_rev(&haystack[haystack.len() - needle.len()..]);
+    loop {
+        if nhash.eq(hash) && is_suffix(haystack, needle) {
+            return Some(haystack.len() - needle.len());
+        }
+        if needle.len() >= haystack.len() {
+            return None;
+        }
+        hash.roll(
+            &nhash,
+            haystack[haystack.len() - 1],
+            haystack[haystack.len() - needle.len() - 1],
+        );
+        haystack = &haystack[..haystack.len() - 1];
+    }
+}
+
+/// A hash derived from a needle.
+#[derive(Clone, Copy, Debug, Default)]
+pub(crate) struct NeedleHash {
+    /// The actual hash.
+    hash: Hash,
+    /// The factor needed to multiply a byte by in order to subtract it from
+    /// the hash. It is defined to be 2^(n-1) (using wrapping exponentiation),
+    /// where n is the length of the needle. This is how we "remove" a byte
+    /// from the hash once the hash window rolls past it.
+    hash_2pow: u32,
+}
+
+impl NeedleHash {
+    /// Create a new Rabin-Karp hash for the given needle for use in forward
+    /// searching.
+    pub(crate) fn forward(needle: &[u8]) -> NeedleHash {
+        let mut nh = NeedleHash { hash: Hash::new(), hash_2pow: 1 };
+        if needle.is_empty() {
+            return nh;
+        }
+        nh.hash.add(needle[0]);
+        for &b in needle.iter().skip(1) {
+            nh.hash.add(b);
+            nh.hash_2pow = nh.hash_2pow.wrapping_shl(1);
+        }
+        nh
+    }
+
+    /// Create a new Rabin-Karp hash for the given needle for use in reverse
+    /// searching.
+    pub(crate) fn reverse(needle: &[u8]) -> NeedleHash {
+        let mut nh = NeedleHash { hash: Hash::new(), hash_2pow: 1 };
+        if needle.is_empty() {
+            return nh;
+        }
+        nh.hash.add(needle[needle.len() - 1]);
+        for &b in needle.iter().rev().skip(1) {
+            nh.hash.add(b);
+            nh.hash_2pow = nh.hash_2pow.wrapping_shl(1);
+        }
+        nh
+    }
+
+    /// Return true if the hashes are equivalent.
+    fn eq(&self, hash: Hash) -> bool {
+        self.hash == hash
+    }
+}
+
+/// A Rabin-Karp hash. This might represent the hash of a needle, or the hash
+/// of a rolling window in the haystack.
+#[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
+pub(crate) struct Hash(u32);
+
+impl Hash {
+    /// Create a new hash that represents the empty string.
+    pub(crate) fn new() -> Hash {
+        Hash(0)
+    }
+
+    /// Create a new hash from the bytes given for use in forward searches.
+    pub(crate) fn from_bytes_fwd(bytes: &[u8]) -> Hash {
+        let mut hash = Hash::new();
+        for &b in bytes {
+            hash.add(b);
+        }
+        hash
+    }
+
+    /// Create a new hash from the bytes given for use in reverse searches.
+    fn from_bytes_rev(bytes: &[u8]) -> Hash {
+        let mut hash = Hash::new();
+        for &b in bytes.iter().rev() {
+            hash.add(b);
+        }
+        hash
+    }
+
+    /// Add 'new' and remove 'old' from this hash. The given needle hash should
+    /// correspond to the hash computed for the needle being searched for.
+    ///
+    /// This is meant to be used when the rolling window of the haystack is
+    /// advanced.
+    fn roll(&mut self, nhash: &NeedleHash, old: u8, new: u8) {
+        self.del(nhash, old);
+        self.add(new);
+    }
+
+    /// Add a byte to this hash.
+    fn add(&mut self, byte: u8) {
+        self.0 = self.0.wrapping_shl(1).wrapping_add(byte as u32);
+    }
+
+    /// Remove a byte from this hash. The given needle hash should correspond
+    /// to the hash computed for the needle being searched for.
+    fn del(&mut self, nhash: &NeedleHash, byte: u8) {
+        let factor = nhash.hash_2pow;
+        self.0 = self.0.wrapping_sub((byte as u32).wrapping_mul(factor));
+    }
+}
+
+/// Returns true if the given needle is a prefix of the given haystack.
+///
+/// We forcefully don't inline the is_prefix call and hint at the compiler that
+/// it is unlikely to be called. This causes the inner rabinkarp loop above
+/// to be a bit tighter and leads to some performance improvement. See the
+/// memmem/krate/prebuilt/sliceslice-words/words benchmark.
+#[cold]
+#[inline(never)]
+fn is_prefix(haystack: &[u8], needle: &[u8]) -> bool {
+    crate::memmem::util::is_prefix(haystack, needle)
+}
+
+/// Returns true if the given needle is a suffix of the given haystack.
+///
+/// See is_prefix for why this is forcefully not inlined.
+#[cold]
+#[inline(never)]
+fn is_suffix(haystack: &[u8], needle: &[u8]) -> bool {
+    crate::memmem::util::is_suffix(haystack, needle)
+}
+
+#[cfg(test)]
+mod simpletests {
+    define_memmem_simple_tests!(super::find, super::rfind);
+}
+
+#[cfg(all(test, feature = "std", not(miri)))]
+mod proptests {
+    define_memmem_quickcheck_tests!(super::find, super::rfind);
+}
@@ -0,0 +1,136 @@
+/// A heuristic frequency based detection of rare bytes for substring search.
+///
+/// This detector attempts to pick out two bytes in a needle that are predicted
+/// to occur least frequently. The purpose is to use these bytes to implement
+/// fast candidate search using vectorized code.
+///
+/// A set of offsets is only computed for needles of length 2 or greater.
+/// Smaller needles should be special cased by the substring search algorithm
+/// in use. (e.g., Use memchr for single byte needles.)
+///
+/// Note that we use `u8` to represent the offsets of the rare bytes in a
+/// needle to reduce space usage. This means that rare byte occurring after the
+/// first 255 bytes in a needle will never be used.
+#[derive(Clone, Copy, Debug, Default)]
+pub(crate) struct RareNeedleBytes {
+    /// The leftmost offset of the rarest byte in the needle, according to
+    /// pre-computed frequency analysis. The "leftmost offset" means that
+    /// rare1i <= i for all i where needle[i] == needle[rare1i].
+    rare1i: u8,
+    /// The leftmost offset of the second rarest byte in the needle, according
+    /// to pre-computed frequency analysis. The "leftmost offset" means that
+    /// rare2i <= i for all i where needle[i] == needle[rare2i].
+    ///
+    /// The second rarest byte is used as a type of guard for quickly detecting
+    /// a mismatch if the first byte matches. This is a hedge against
+    /// pathological cases where the pre-computed frequency analysis may be
+    /// off. (But of course, does not prevent *all* pathological cases.)
+    ///
+    /// In general, rare1i != rare2i by construction, although there is no hard
+    /// requirement that they be different. However, since the case of a single
+    /// byte needle is handled specially by memchr itself, rare2i generally
+    /// always should be different from rare1i since it would otherwise be
+    /// ineffective as a guard.
+    rare2i: u8,
+}
+
+impl RareNeedleBytes {
+    /// Create a new pair of rare needle bytes with the given offsets. This is
+    /// only used in tests for generating input data.
+    #[cfg(all(test, feature = "std"))]
+    pub(crate) fn new(rare1i: u8, rare2i: u8) -> RareNeedleBytes {
+        RareNeedleBytes { rare1i, rare2i }
+    }
+
+    /// Detect the leftmost offsets of the two rarest bytes in the given
+    /// needle.
+    pub(crate) fn forward(needle: &[u8]) -> RareNeedleBytes {
+        if needle.len() <= 1 || needle.len() > core::u8::MAX as usize {
+            // For needles bigger than u8::MAX, our offsets aren't big enough.
+            // (We make our offsets small to reduce stack copying.)
+            // If you have a use case for it, please file an issue. In that
+            // case, we should probably just adjust the routine below to pick
+            // some rare bytes from the first 255 bytes of the needle.
+            //
+            // Also note that for needles of size 0 or 1, they are special
+            // cased in Two-Way.
+            //
+            // TODO: Benchmar this.
+            return RareNeedleBytes { rare1i: 0, rare2i: 0 };
+        }
+
+        // Find the rarest two bytes. We make them distinct by construction.
+        let (mut rare1, mut rare1i) = (needle[0], 0);
+        let (mut rare2, mut rare2i) = (needle[1], 1);
+        if rank(rare2) < rank(rare1) {
+            core::mem::swap(&mut rare1, &mut rare2);
+            core::mem::swap(&mut rare1i, &mut rare2i);
+        }
+        for (i, &b) in needle.iter().enumerate().skip(2) {
+            if rank(b) < rank(rare1) {
+                rare2 = rare1;
+                rare2i = rare1i;
+                rare1 = b;
+                rare1i = i as u8;
+            } else if b != rare1 && rank(b) < rank(rare2) {
+                rare2 = b;
+                rare2i = i as u8;
+            }
+        }
+        // While not strictly required, we really don't want these to be
+        // equivalent. If they were, it would reduce the effectiveness of
+        // candidate searching using these rare bytes by increasing the rate of
+        // false positives.
+        assert_ne!(rare1i, rare2i);
+        RareNeedleBytes { rare1i, rare2i }
+    }
+
+    /// Return the rare bytes in the given needle in the forward direction.
+    /// The needle given must be the same one given to the RareNeedleBytes
+    /// constructor.
+    pub(crate) fn as_rare_bytes(&self, needle: &[u8]) -> (u8, u8) {
+        (needle[self.rare1i as usize], needle[self.rare2i as usize])
+    }
+
+    /// Return the rare offsets such that the first offset is always <= to the
+    /// second offset. This is useful when the caller doesn't care whether
+    /// rare1 is rarer than rare2, but just wants to ensure that they are
+    /// ordered with respect to one another.
+    #[cfg(memchr_runtime_simd)]
+    pub(crate) fn as_rare_ordered_usize(&self) -> (usize, usize) {
+        let (rare1i, rare2i) = self.as_rare_ordered_u8();
+        (rare1i as usize, rare2i as usize)
+    }
+
+    /// Like as_rare_ordered_usize, but returns the offsets as their native
+    /// u8 values.
+    #[cfg(memchr_runtime_simd)]
+    pub(crate) fn as_rare_ordered_u8(&self) -> (u8, u8) {
+        if self.rare1i <= self.rare2i {
+            (self.rare1i, self.rare2i)
+        } else {
+            (self.rare2i, self.rare1i)
+        }
+    }
+
+    /// Return the rare offsets as usize values in the order in which they were
+    /// constructed. rare1, for example, is constructed as the "rarer" byte,
+    /// and thus, callers may want to treat it differently from rare2.
+    pub(crate) fn as_rare_usize(&self) -> (usize, usize) {
+        (self.rare1i as usize, self.rare2i as usize)
+    }
+
+    /// Return the byte frequency rank of each byte. The higher the rank, the
+    /// more frequency the byte is predicted to be. The needle given must be
+    /// the same one given to the RareNeedleBytes constructor.
+    pub(crate) fn as_ranks(&self, needle: &[u8]) -> (usize, usize) {
+        let (b1, b2) = self.as_rare_bytes(needle);
+        (rank(b1), rank(b2))
+    }
+}
+
+/// Return the heuristical frequency rank of the given byte. A lower rank
+/// means the byte is believed to occur less frequently.
+fn rank(b: u8) -> usize {
+    crate::memmem::byte_frequencies::BYTE_FREQUENCIES[b as usize] as usize
+}
@@ -0,0 +1,878 @@
+use core::cmp;
+
+use crate::memmem::{prefilter::Pre, util};
+
+/// Two-Way search in the forward direction.
+#[derive(Clone, Copy, Debug)]
+pub(crate) struct Forward(TwoWay);
+
+/// Two-Way search in the reverse direction.
+#[derive(Clone, Copy, Debug)]
+pub(crate) struct Reverse(TwoWay);
+
+/// An implementation of the TwoWay substring search algorithm, with heuristics
+/// for accelerating search based on frequency analysis.
+///
+/// This searcher supports forward and reverse search, although not
+/// simultaneously. It runs in O(n + m) time and O(1) space, where
+/// `n ~ len(needle)` and `m ~ len(haystack)`.
+///
+/// The implementation here roughly matches that which was developed by
+/// Crochemore and Perrin in their 1991 paper "Two-way string-matching." The
+/// changes in this implementation are 1) the use of zero-based indices, 2) a
+/// heuristic skip table based on the last byte (borrowed from Rust's standard
+/// library) and 3) the addition of heuristics for a fast skip loop. That is,
+/// (3) this will detect bytes that are believed to be rare in the needle and
+/// use fast vectorized instructions to find their occurrences quickly. The
+/// Two-Way algorithm is then used to confirm whether a match at that location
+/// occurred.
+///
+/// The heuristic for fast skipping is automatically shut off if it's
+/// detected to be ineffective at search time. Generally, this only occurs in
+/// pathological cases. But this is generally necessary in order to preserve
+/// a `O(n + m)` time bound.
+///
+/// The code below is fairly complex and not obviously correct at all. It's
+/// likely necessary to read the Two-Way paper cited above in order to fully
+/// grok this code. The essence of it is:
+///
+/// 1) Do something to detect a "critical" position in the needle.
+/// 2) For the current position in the haystack, look if needle[critical..]
+///    matches at that position.
+/// 3) If so, look if needle[..critical] matches.
+/// 4) If a mismatch occurs, shift the search by some amount based on the
+///    critical position and a pre-computed shift.
+///
+/// This type is wrapped in Forward and Reverse types that expose consistent
+/// forward or reverse APIs.
+#[derive(Clone, Copy, Debug)]
+struct TwoWay {
+    /// A small bitset used as a quick prefilter (in addition to the faster
+    /// SIMD based prefilter). Namely, a bit 'i' is set if and only if b%64==i
+    /// for any b in the needle.
+    ///
+    /// When used as a prefilter, if the last byte at the current candidate
+    /// position is NOT in this set, then we can skip that entire candidate
+    /// position (the length of the needle). This is essentially the shift
+    /// trick found in Boyer-Moore, but only applied to bytes that don't appear
+    /// in the needle.
+    ///
+    /// N.B. This trick was inspired by something similar in std's
+    /// implementation of Two-Way.
+    byteset: ApproximateByteSet,
+    /// A critical position in needle. Specifically, this position corresponds
+    /// to beginning of either the minimal or maximal suffix in needle. (N.B.
+    /// See SuffixType below for why "minimal" isn't quite the correct word
+    /// here.)
+    ///
+    /// This is the position at which every search begins. Namely, search
+    /// starts by scanning text to the right of this position, and only if
+    /// there's a match does the text to the left of this position get scanned.
+    critical_pos: usize,
+    /// The amount we shift by in the Two-Way search algorithm. This
+    /// corresponds to the "small period" and "large period" cases.
+    shift: Shift,
+}
+
+impl Forward {
+    /// Create a searcher that uses the Two-Way algorithm by searching forwards
+    /// through any haystack.
+    pub(crate) fn new(needle: &[u8]) -> Forward {
+        if needle.is_empty() {
+            return Forward(TwoWay::empty());
+        }
+
+        let byteset = ApproximateByteSet::new(needle);
+        let min_suffix = Suffix::forward(needle, SuffixKind::Minimal);
+        let max_suffix = Suffix::forward(needle, SuffixKind::Maximal);
+        let (period_lower_bound, critical_pos) =
+            if min_suffix.pos > max_suffix.pos {
+                (min_suffix.period, min_suffix.pos)
+            } else {
+                (max_suffix.period, max_suffix.pos)
+            };
+        let shift = Shift::forward(needle, period_lower_bound, critical_pos);
+        Forward(TwoWay { byteset, critical_pos, shift })
+    }
+
+    /// Find the position of the first occurrence of this searcher's needle in
+    /// the given haystack. If one does not exist, then return None.
+    ///
+    /// This accepts prefilter state that is useful when using the same
+    /// searcher multiple times, such as in an iterator.
+    ///
+    /// Callers must guarantee that the needle is non-empty and its length is
+    /// <= the haystack's length.
+    #[inline(always)]
+    pub(crate) fn find(
+        &self,
+        pre: Option<&mut Pre<'_>>,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        debug_assert!(!needle.is_empty(), "needle should not be empty");
+        debug_assert!(needle.len() <= haystack.len(), "haystack too short");
+
+        match self.0.shift {
+            Shift::Small { period } => {
+                self.find_small_imp(pre, haystack, needle, period)
+            }
+            Shift::Large { shift } => {
+                self.find_large_imp(pre, haystack, needle, shift)
+            }
+        }
+    }
+
+    /// Like find, but handles the degenerate substring test cases. This is
+    /// only useful for conveniently testing this substring implementation in
+    /// isolation.
+    #[cfg(test)]
+    fn find_general(
+        &self,
+        pre: Option<&mut Pre<'_>>,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        if needle.is_empty() {
+            Some(0)
+        } else if haystack.len() < needle.len() {
+            None
+        } else {
+            self.find(pre, haystack, needle)
+        }
+    }
+
+    // Each of the two search implementations below can be accelerated by a
+    // prefilter, but it is not always enabled. To avoid its overhead when
+    // its disabled, we explicitly inline each search implementation based on
+    // whether a prefilter will be used or not. The decision on which to use
+    // is made in the parent meta searcher.
+
+    #[inline(always)]
+    fn find_small_imp(
+        &self,
+        mut pre: Option<&mut Pre<'_>>,
+        haystack: &[u8],
+        needle: &[u8],
+        period: usize,
+    ) -> Option<usize> {
+        let last_byte = needle.len() - 1;
+        let mut pos = 0;
+        let mut shift = 0;
+        while pos + needle.len() <= haystack.len() {
+            let mut i = cmp::max(self.0.critical_pos, shift);
+            if let Some(pre) = pre.as_mut() {
+                if pre.should_call() {
+                    pos += pre.call(&haystack[pos..], needle)?;
+                    shift = 0;
+                    i = self.0.critical_pos;
+                    if pos + needle.len() > haystack.len() {
+                        return None;
+                    }
+                }
+            }
+            if !self.0.byteset.contains(haystack[pos + last_byte]) {
+                pos += needle.len();
+                shift = 0;
+                continue;
+            }
+            while i < needle.len() && needle[i] == haystack[pos + i] {
+                i += 1;
+            }
+            if i < needle.len() {
+                pos += i - self.0.critical_pos + 1;
+                shift = 0;
+            } else {
+                let mut j = self.0.critical_pos;
+                while j > shift && needle[j] == haystack[pos + j] {
+                    j -= 1;
+                }
+                if j <= shift && needle[shift] == haystack[pos + shift] {
+                    return Some(pos);
+                }
+                pos += period;
+                shift = needle.len() - period;
+            }
+        }
+        None
+    }
+
+    #[inline(always)]
+    fn find_large_imp(
+        &self,
+        mut pre: Option<&mut Pre<'_>>,
+        haystack: &[u8],
+        needle: &[u8],
+        shift: usize,
+    ) -> Option<usize> {
+        let last_byte = needle.len() - 1;
+        let mut pos = 0;
+        'outer: while pos + needle.len() <= haystack.len() {
+            if let Some(pre) = pre.as_mut() {
+                if pre.should_call() {
+                    pos += pre.call(&haystack[pos..], needle)?;
+                    if pos + needle.len() > haystack.len() {
+                        return None;
+                    }
+                }
+            }
+
+            if !self.0.byteset.contains(haystack[pos + last_byte]) {
+                pos += needle.len();
+                continue;
+            }
+            let mut i = self.0.critical_pos;
+            while i < needle.len() && needle[i] == haystack[pos + i] {
+                i += 1;
+            }
+            if i < needle.len() {
+                pos += i - self.0.critical_pos + 1;
+            } else {
+                for j in (0..self.0.critical_pos).rev() {
+                    if needle[j] != haystack[pos + j] {
+                        pos += shift;
+                        continue 'outer;
+                    }
+                }
+                return Some(pos);
+            }
+        }
+        None
+    }
+}
+
+impl Reverse {
+    /// Create a searcher that uses the Two-Way algorithm by searching in
+    /// reverse through any haystack.
+    pub(crate) fn new(needle: &[u8]) -> Reverse {
+        if needle.is_empty() {
+            return Reverse(TwoWay::empty());
+        }
+
+        let byteset = ApproximateByteSet::new(needle);
+        let min_suffix = Suffix::reverse(needle, SuffixKind::Minimal);
+        let max_suffix = Suffix::reverse(needle, SuffixKind::Maximal);
+        let (period_lower_bound, critical_pos) =
+            if min_suffix.pos < max_suffix.pos {
+                (min_suffix.period, min_suffix.pos)
+            } else {
+                (max_suffix.period, max_suffix.pos)
+            };
+        // let critical_pos = needle.len() - critical_pos;
+        let shift = Shift::reverse(needle, period_lower_bound, critical_pos);
+        Reverse(TwoWay { byteset, critical_pos, shift })
+    }
+
+    /// Find the position of the last occurrence of this searcher's needle
+    /// in the given haystack. If one does not exist, then return None.
+    ///
+    /// This will automatically initialize prefilter state. This should only
+    /// be used for one-off searches.
+    ///
+    /// Callers must guarantee that the needle is non-empty and its length is
+    /// <= the haystack's length.
+    #[inline(always)]
+    pub(crate) fn rfind(
+        &self,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        debug_assert!(!needle.is_empty(), "needle should not be empty");
+        debug_assert!(needle.len() <= haystack.len(), "haystack too short");
+        // For the reverse case, we don't use a prefilter. It's plausible that
+        // perhaps we should, but it's a lot of additional code to do it, and
+        // it's not clear that it's actually worth it. If you have a really
+        // compelling use case for this, please file an issue.
+        match self.0.shift {
+            Shift::Small { period } => {
+                self.rfind_small_imp(haystack, needle, period)
+            }
+            Shift::Large { shift } => {
+                self.rfind_large_imp(haystack, needle, shift)
+            }
+        }
+    }
+
+    /// Like rfind, but handles the degenerate substring test cases. This is
+    /// only useful for conveniently testing this substring implementation in
+    /// isolation.
+    #[cfg(test)]
+    fn rfind_general(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
+        if needle.is_empty() {
+            Some(haystack.len())
+        } else if haystack.len() < needle.len() {
+            None
+        } else {
+            self.rfind(haystack, needle)
+        }
+    }
+
+    #[inline(always)]
+    fn rfind_small_imp(
+        &self,
+        haystack: &[u8],
+        needle: &[u8],
+        period: usize,
+    ) -> Option<usize> {
+        let nlen = needle.len();
+        let mut pos = haystack.len();
+        let mut shift = nlen;
+        while pos >= nlen {
+            if !self.0.byteset.contains(haystack[pos - nlen]) {
+                pos -= nlen;
+                shift = nlen;
+                continue;
+            }
+            let mut i = cmp::min(self.0.critical_pos, shift);
+            while i > 0 && needle[i - 1] == haystack[pos - nlen + i - 1] {
+                i -= 1;
+            }
+            if i > 0 || needle[0] != haystack[pos - nlen] {
+                pos -= self.0.critical_pos - i + 1;
+                shift = nlen;
+            } else {
+                let mut j = self.0.critical_pos;
+                while j < shift && needle[j] == haystack[pos - nlen + j] {
+                    j += 1;
+                }
+                if j >= shift {
+                    return Some(pos - nlen);
+                }
+                pos -= period;
+                shift = period;
+            }
+        }
+        None
+    }
+
+    #[inline(always)]
+    fn rfind_large_imp(
+        &self,
+        haystack: &[u8],
+        needle: &[u8],
+        shift: usize,
+    ) -> Option<usize> {
+        let nlen = needle.len();
+        let mut pos = haystack.len();
+        while pos >= nlen {
+            if !self.0.byteset.contains(haystack[pos - nlen]) {
+                pos -= nlen;
+                continue;
+            }
+            let mut i = self.0.critical_pos;
+            while i > 0 && needle[i - 1] == haystack[pos - nlen + i - 1] {
+                i -= 1;
+            }
+            if i > 0 || needle[0] != haystack[pos - nlen] {
+                pos -= self.0.critical_pos - i + 1;
+            } else {
+                let mut j = self.0.critical_pos;
+                while j < nlen && needle[j] == haystack[pos - nlen + j] {
+                    j += 1;
+                }
+                if j == nlen {
+                    return Some(pos - nlen);
+                }
+                pos -= shift;
+            }
+        }
+        None
+    }
+}
+
+impl TwoWay {
+    fn empty() -> TwoWay {
+        TwoWay {
+            byteset: ApproximateByteSet::new(b""),
+            critical_pos: 0,
+            shift: Shift::Large { shift: 0 },
+        }
+    }
+}
+
+/// A representation of the amount we're allowed to shift by during Two-Way
+/// search.
+///
+/// When computing a critical factorization of the needle, we find the position
+/// of the critical factorization by finding the needle's maximal (or minimal)
+/// suffix, along with the period of that suffix. It turns out that the period
+/// of that suffix is a lower bound on the period of the needle itself.
+///
+/// This lower bound is equivalent to the actual period of the needle in
+/// some cases. To describe that case, we denote the needle as `x` where
+/// `x = uv` and `v` is the lexicographic maximal suffix of `v`. The lower
+/// bound given here is always the period of `v`, which is `<= period(x)`. The
+/// case where `period(v) == period(x)` occurs when `len(u) < (len(x) / 2)` and
+/// where `u` is a suffix of `v[0..period(v)]`.
+///
+/// This case is important because the search algorithm for when the
+/// periods are equivalent is slightly different than the search algorithm
+/// for when the periods are not equivalent. In particular, when they aren't
+/// equivalent, we know that the period of the needle is no less than half its
+/// length. In this case, we shift by an amount less than or equal to the
+/// period of the needle (determined by the maximum length of the components
+/// of the critical factorization of `x`, i.e., `max(len(u), len(v))`)..
+///
+/// The above two cases are represented by the variants below. Each entails
+/// a different instantiation of the Two-Way search algorithm.
+///
+/// N.B. If we could find a way to compute the exact period in all cases,
+/// then we could collapse this case analysis and simplify the algorithm. The
+/// Two-Way paper suggests this is possible, but more reading is required to
+/// grok why the authors didn't pursue that path.
+#[derive(Clone, Copy, Debug)]
+enum Shift {
+    Small { period: usize },
+    Large { shift: usize },
+}
+
+impl Shift {
+    /// Compute the shift for a given needle in the forward direction.
+    ///
+    /// This requires a lower bound on the period and a critical position.
+    /// These can be computed by extracting both the minimal and maximal
+    /// lexicographic suffixes, and choosing the right-most starting position.
+    /// The lower bound on the period is then the period of the chosen suffix.
+    fn forward(
+        needle: &[u8],
+        period_lower_bound: usize,
+        critical_pos: usize,
+    ) -> Shift {
+        let large = cmp::max(critical_pos, needle.len() - critical_pos);
+        if critical_pos * 2 >= needle.len() {
+            return Shift::Large { shift: large };
+        }
+
+        let (u, v) = needle.split_at(critical_pos);
+        if !util::is_suffix(&v[..period_lower_bound], u) {
+            return Shift::Large { shift: large };
+        }
+        Shift::Small { period: period_lower_bound }
+    }
+
+    /// Compute the shift for a given needle in the reverse direction.
+    ///
+    /// This requires a lower bound on the period and a critical position.
+    /// These can be computed by extracting both the minimal and maximal
+    /// lexicographic suffixes, and choosing the left-most starting position.
+    /// The lower bound on the period is then the period of the chosen suffix.
+    fn reverse(
+        needle: &[u8],
+        period_lower_bound: usize,
+        critical_pos: usize,
+    ) -> Shift {
+        let large = cmp::max(critical_pos, needle.len() - critical_pos);
+        if (needle.len() - critical_pos) * 2 >= needle.len() {
+            return Shift::Large { shift: large };
+        }
+
+        let (v, u) = needle.split_at(critical_pos);
+        if !util::is_prefix(&v[v.len() - period_lower_bound..], u) {
+            return Shift::Large { shift: large };
+        }
+        Shift::Small { period: period_lower_bound }
+    }
+}
+
+/// A suffix extracted from a needle along with its period.
+#[derive(Debug)]
+struct Suffix {
+    /// The starting position of this suffix.
+    ///
+    /// If this is a forward suffix, then `&bytes[pos..]` can be used. If this
+    /// is a reverse suffix, then `&bytes[..pos]` can be used. That is, for
+    /// forward suffixes, this is an inclusive starting position, where as for
+    /// reverse suffixes, this is an exclusive ending position.
+    pos: usize,
+    /// The period of this suffix.
+    ///
+    /// Note that this is NOT necessarily the period of the string from which
+    /// this suffix comes from. (It is always less than or equal to the period
+    /// of the original string.)
+    period: usize,
+}
+
+impl Suffix {
+    fn forward(needle: &[u8], kind: SuffixKind) -> Suffix {
+        debug_assert!(!needle.is_empty());
+
+        // suffix represents our maximal (or minimal) suffix, along with
+        // its period.
+        let mut suffix = Suffix { pos: 0, period: 1 };
+        // The start of a suffix in `needle` that we are considering as a
+        // more maximal (or minimal) suffix than what's in `suffix`.
+        let mut candidate_start = 1;
+        // The current offset of our suffixes that we're comparing.
+        //
+        // When the characters at this offset are the same, then we mush on
+        // to the next position since no decision is possible. When the
+        // candidate's character is greater (or lesser) than the corresponding
+        // character than our current maximal (or minimal) suffix, then the
+        // current suffix is changed over to the candidate and we restart our
+        // search. Otherwise, the candidate suffix is no good and we restart
+        // our search on the next candidate.
+        //
+        // The three cases above correspond to the three cases in the loop
+        // below.
+        let mut offset = 0;
+
+        while candidate_start + offset < needle.len() {
+            let current = needle[suffix.pos + offset];
+            let candidate = needle[candidate_start + offset];
+            match kind.cmp(current, candidate) {
+                SuffixOrdering::Accept => {
+                    suffix = Suffix { pos: candidate_start, period: 1 };
+                    candidate_start += 1;
+                    offset = 0;
+                }
+                SuffixOrdering::Skip => {
+                    candidate_start += offset + 1;
+                    offset = 0;
+                    suffix.period = candidate_start - suffix.pos;
+                }
+                SuffixOrdering::Push => {
+                    if offset + 1 == suffix.period {
+                        candidate_start += suffix.period;
+                        offset = 0;
+                    } else {
+                        offset += 1;
+                    }
+                }
+            }
+        }
+        suffix
+    }
+
+    fn reverse(needle: &[u8], kind: SuffixKind) -> Suffix {
+        debug_assert!(!needle.is_empty());
+
+        // See the comments in `forward` for how this works.
+        let mut suffix = Suffix { pos: needle.len(), period: 1 };
+        if needle.len() == 1 {
+            return suffix;
+        }
+        let mut candidate_start = needle.len() - 1;
+        let mut offset = 0;
+
+        while offset < candidate_start {
+            let current = needle[suffix.pos - offset - 1];
+            let candidate = needle[candidate_start - offset - 1];
+            match kind.cmp(current, candidate) {
+                SuffixOrdering::Accept => {
+                    suffix = Suffix { pos: candidate_start, period: 1 };
+                    candidate_start -= 1;
+                    offset = 0;
+                }
+                SuffixOrdering::Skip => {
+                    candidate_start -= offset + 1;
+                    offset = 0;
+                    suffix.period = suffix.pos - candidate_start;
+                }
+                SuffixOrdering::Push => {
+                    if offset + 1 == suffix.period {
+                        candidate_start -= suffix.period;
+                        offset = 0;
+                    } else {
+                        offset += 1;
+                    }
+                }
+            }
+        }
+        suffix
+    }
+}
+
+/// The kind of suffix to extract.
+#[derive(Clone, Copy, Debug)]
+enum SuffixKind {
+    /// Extract the smallest lexicographic suffix from a string.
+    ///
+    /// Technically, this doesn't actually pick the smallest lexicographic
+    /// suffix. e.g., Given the choice between `a` and `aa`, this will choose
+    /// the latter over the former, even though `a < aa`. The reasoning for
+    /// this isn't clear from the paper, but it still smells like a minimal
+    /// suffix.
+    Minimal,
+    /// Extract the largest lexicographic suffix from a string.
+    ///
+    /// Unlike `Minimal`, this really does pick the maximum suffix. e.g., Given
+    /// the choice between `z` and `zz`, this will choose the latter over the
+    /// former.
+    Maximal,
+}
+
+/// The result of comparing corresponding bytes between two suffixes.
+#[derive(Clone, Copy, Debug)]
+enum SuffixOrdering {
+    /// This occurs when the given candidate byte indicates that the candidate
+    /// suffix is better than the current maximal (or minimal) suffix. That is,
+    /// the current candidate suffix should supplant the current maximal (or
+    /// minimal) suffix.
+    Accept,
+    /// This occurs when the given candidate byte excludes the candidate suffix
+    /// from being better than the current maximal (or minimal) suffix. That
+    /// is, the current candidate suffix should be dropped and the next one
+    /// should be considered.
+    Skip,
+    /// This occurs when no decision to accept or skip the candidate suffix
+    /// can be made, e.g., when corresponding bytes are equivalent. In this
+    /// case, the next corresponding bytes should be compared.
+    Push,
+}
+
+impl SuffixKind {
+    /// Returns true if and only if the given candidate byte indicates that
+    /// it should replace the current suffix as the maximal (or minimal)
+    /// suffix.
+    fn cmp(self, current: u8, candidate: u8) -> SuffixOrdering {
+        use self::SuffixOrdering::*;
+
+        match self {
+            SuffixKind::Minimal if candidate < current => Accept,
+            SuffixKind::Minimal if candidate > current => Skip,
+            SuffixKind::Minimal => Push,
+            SuffixKind::Maximal if candidate > current => Accept,
+            SuffixKind::Maximal if candidate < current => Skip,
+            SuffixKind::Maximal => Push,
+        }
+    }
+}
+
+/// A bitset used to track whether a particular byte exists in a needle or not.
+///
+/// Namely, bit 'i' is set if and only if byte%64==i for any byte in the
+/// needle. If a particular byte in the haystack is NOT in this set, then one
+/// can conclude that it is also not in the needle, and thus, one can advance
+/// in the haystack by needle.len() bytes.
+#[derive(Clone, Copy, Debug)]
+struct ApproximateByteSet(u64);
+
+impl ApproximateByteSet {
+    /// Create a new set from the given needle.
+    fn new(needle: &[u8]) -> ApproximateByteSet {
+        let mut bits = 0;
+        for &b in needle {
+            bits |= 1 << (b % 64);
+        }
+        ApproximateByteSet(bits)
+    }
+
+    /// Return true if and only if the given byte might be in this set. This
+    /// may return a false positive, but will never return a false negative.
+    #[inline(always)]
+    fn contains(&self, byte: u8) -> bool {
+        self.0 & (1 << (byte % 64)) != 0
+    }
+}
+
+#[cfg(all(test, feature = "std", not(miri)))]
+mod tests {
+    use quickcheck::quickcheck;
+
+    use super::*;
+
+    define_memmem_quickcheck_tests!(
+        super::simpletests::twoway_find,
+        super::simpletests::twoway_rfind
+    );
+
+    /// Convenience wrapper for computing the suffix as a byte string.
+    fn get_suffix_forward(needle: &[u8], kind: SuffixKind) -> (&[u8], usize) {
+        let s = Suffix::forward(needle, kind);
+        (&needle[s.pos..], s.period)
+    }
+
+    /// Convenience wrapper for computing the reverse suffix as a byte string.
+    fn get_suffix_reverse(needle: &[u8], kind: SuffixKind) -> (&[u8], usize) {
+        let s = Suffix::reverse(needle, kind);
+        (&needle[..s.pos], s.period)
+    }
+
+    /// Return all of the non-empty suffixes in the given byte string.
+    fn suffixes(bytes: &[u8]) -> Vec<&[u8]> {
+        (0..bytes.len()).map(|i| &bytes[i..]).collect()
+    }
+
+    /// Return the lexicographically maximal suffix of the given byte string.
+    fn naive_maximal_suffix_forward(needle: &[u8]) -> &[u8] {
+        let mut sufs = suffixes(needle);
+        sufs.sort();
+        sufs.pop().unwrap()
+    }
+
+    /// Return the lexicographically maximal suffix of the reverse of the given
+    /// byte string.
+    fn naive_maximal_suffix_reverse(needle: &[u8]) -> Vec<u8> {
+        let mut reversed = needle.to_vec();
+        reversed.reverse();
+        let mut got = naive_maximal_suffix_forward(&reversed).to_vec();
+        got.reverse();
+        got
+    }
+
+    #[test]
+    fn suffix_forward() {
+        macro_rules! assert_suffix_min {
+            ($given:expr, $expected:expr, $period:expr) => {
+                let (got_suffix, got_period) =
+                    get_suffix_forward($given.as_bytes(), SuffixKind::Minimal);
+                let got_suffix = std::str::from_utf8(got_suffix).unwrap();
+                assert_eq!(($expected, $period), (got_suffix, got_period));
+            };
+        }
+
+        macro_rules! assert_suffix_max {
+            ($given:expr, $expected:expr, $period:expr) => {
+                let (got_suffix, got_period) =
+                    get_suffix_forward($given.as_bytes(), SuffixKind::Maximal);
+                let got_suffix = std::str::from_utf8(got_suffix).unwrap();
+                assert_eq!(($expected, $period), (got_suffix, got_period));
+            };
+        }
+
+        assert_suffix_min!("a", "a", 1);
+        assert_suffix_max!("a", "a", 1);
+
+        assert_suffix_min!("ab", "ab", 2);
+        assert_suffix_max!("ab", "b", 1);
+
+        assert_suffix_min!("ba", "a", 1);
+        assert_suffix_max!("ba", "ba", 2);
+
+        assert_suffix_min!("abc", "abc", 3);
+        assert_suffix_max!("abc", "c", 1);
+
+        assert_suffix_min!("acb", "acb", 3);
+        assert_suffix_max!("acb", "cb", 2);
+
+        assert_suffix_min!("cba", "a", 1);
+        assert_suffix_max!("cba", "cba", 3);
+
+        assert_suffix_min!("abcabc", "abcabc", 3);
+        assert_suffix_max!("abcabc", "cabc", 3);
+
+        assert_suffix_min!("abcabcabc", "abcabcabc", 3);
+        assert_suffix_max!("abcabcabc", "cabcabc", 3);
+
+        assert_suffix_min!("abczz", "abczz", 5);
+        assert_suffix_max!("abczz", "zz", 1);
+
+        assert_suffix_min!("zzabc", "abc", 3);
+        assert_suffix_max!("zzabc", "zzabc", 5);
+
+        assert_suffix_min!("aaa", "aaa", 1);
+        assert_suffix_max!("aaa", "aaa", 1);
+
+        assert_suffix_min!("foobar", "ar", 2);
+        assert_suffix_max!("foobar", "r", 1);
+    }
+
+    #[test]
+    fn suffix_reverse() {
+        macro_rules! assert_suffix_min {
+            ($given:expr, $expected:expr, $period:expr) => {
+                let (got_suffix, got_period) =
+                    get_suffix_reverse($given.as_bytes(), SuffixKind::Minimal);
+                let got_suffix = std::str::from_utf8(got_suffix).unwrap();
+                assert_eq!(($expected, $period), (got_suffix, got_period));
+            };
+        }
+
+        macro_rules! assert_suffix_max {
+            ($given:expr, $expected:expr, $period:expr) => {
+                let (got_suffix, got_period) =
+                    get_suffix_reverse($given.as_bytes(), SuffixKind::Maximal);
+                let got_suffix = std::str::from_utf8(got_suffix).unwrap();
+                assert_eq!(($expected, $period), (got_suffix, got_period));
+            };
+        }
+
+        assert_suffix_min!("a", "a", 1);
+        assert_suffix_max!("a", "a", 1);
+
+        assert_suffix_min!("ab", "a", 1);
+        assert_suffix_max!("ab", "ab", 2);
+
+        assert_suffix_min!("ba", "ba", 2);
+        assert_suffix_max!("ba", "b", 1);
+
+        assert_suffix_min!("abc", "a", 1);
+        assert_suffix_max!("abc", "abc", 3);
+
+        assert_suffix_min!("acb", "a", 1);
+        assert_suffix_max!("acb", "ac", 2);
+
+        assert_suffix_min!("cba", "cba", 3);
+        assert_suffix_max!("cba", "c", 1);
+
+        assert_suffix_min!("abcabc", "abca", 3);
+        assert_suffix_max!("abcabc", "abcabc", 3);
+
+        assert_suffix_min!("abcabcabc", "abcabca", 3);
+        assert_suffix_max!("abcabcabc", "abcabcabc", 3);
+
+        assert_suffix_min!("abczz", "a", 1);
+        assert_suffix_max!("abczz", "abczz", 5);
+
+        assert_suffix_min!("zzabc", "zza", 3);
+        assert_suffix_max!("zzabc", "zz", 1);
+
+        assert_suffix_min!("aaa", "aaa", 1);
+        assert_suffix_max!("aaa", "aaa", 1);
+    }
+
+    quickcheck! {
+        fn qc_suffix_forward_maximal(bytes: Vec<u8>) -> bool {
+            if bytes.is_empty() {
+                return true;
+            }
+
+            let (got, _) = get_suffix_forward(&bytes, SuffixKind::Maximal);
+            let expected = naive_maximal_suffix_forward(&bytes);
+            got == expected
+        }
+
+        fn qc_suffix_reverse_maximal(bytes: Vec<u8>) -> bool {
+            if bytes.is_empty() {
+                return true;
+            }
+
+            let (got, _) = get_suffix_reverse(&bytes, SuffixKind::Maximal);
+            let expected = naive_maximal_suffix_reverse(&bytes);
+            expected == got
+        }
+    }
+}
+
+#[cfg(test)]
+mod simpletests {
+    use super::*;
+
+    pub(crate) fn twoway_find(
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        Forward::new(needle).find_general(None, haystack, needle)
+    }
+
+    pub(crate) fn twoway_rfind(
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        Reverse::new(needle).rfind_general(haystack, needle)
+    }
+
+    define_memmem_simple_tests!(twoway_find, twoway_rfind);
+
+    // This is a regression test caught by quickcheck that exercised a bug in
+    // the reverse small period handling. The bug was that we were using 'if j
+    // == shift' to determine if a match occurred, but the correct guard is 'if
+    // j >= shift', which matches the corresponding guard in the forward impl.
+    #[test]
+    fn regression_rev_small_period() {
+        let rfind = super::simpletests::twoway_rfind;
+        let haystack = "ababaz";
+        let needle = "abab";
+        assert_eq!(Some(0), rfind(haystack.as_bytes(), needle.as_bytes()));
+    }
+}
@@ -0,0 +1,88 @@
+// These routines are meant to be optimized specifically for low latency as
+// compared to the equivalent routines offered by std. (Which may invoke the
+// dynamic linker and call out to libc, which introduces a bit more latency
+// than we'd like.)
+
+/// Returns true if and only if needle is a prefix of haystack.
+#[inline(always)]
+pub(crate) fn is_prefix(haystack: &[u8], needle: &[u8]) -> bool {
+    needle.len() <= haystack.len() && memcmp(&haystack[..needle.len()], needle)
+}
+
+/// Returns true if and only if needle is a suffix of haystack.
+#[inline(always)]
+pub(crate) fn is_suffix(haystack: &[u8], needle: &[u8]) -> bool {
+    needle.len() <= haystack.len()
+        && memcmp(&haystack[haystack.len() - needle.len()..], needle)
+}
+
+/// Return true if and only if x.len() == y.len() && x[i] == y[i] for all
+/// 0 <= i < x.len().
+///
+/// Why not just use actual memcmp for this? Well, memcmp requires calling out
+/// to libc, and this routine is called in fairly hot code paths. Other than
+/// just calling out to libc, it also seems to result in worse codegen. By
+/// rolling our own memcmp in pure Rust, it seems to appear more friendly to
+/// the optimizer.
+///
+/// We mark this as inline always, although, some callers may not want it
+/// inlined for better codegen (like Rabin-Karp). In that case, callers are
+/// advised to create a non-inlineable wrapper routine that calls memcmp.
+#[inline(always)]
+pub(crate) fn memcmp(x: &[u8], y: &[u8]) -> bool {
+    if x.len() != y.len() {
+        return false;
+    }
+    // If we don't have enough bytes to do 4-byte at a time loads, then
+    // fall back to the naive slow version.
+    //
+    // TODO: We could do a copy_nonoverlapping combined with a mask instead
+    // of a loop. Benchmark it.
+    if x.len() < 4 {
+        for (&b1, &b2) in x.iter().zip(y) {
+            if b1 != b2 {
+                return false;
+            }
+        }
+        return true;
+    }
+    // When we have 4 or more bytes to compare, then proceed in chunks of 4 at
+    // a time using unaligned loads.
+    //
+    // Also, why do 4 byte loads instead of, say, 8 byte loads? The reason is
+    // that this particular version of memcmp is likely to be called with tiny
+    // needles. That means that if we do 8 byte loads, then a higher proportion
+    // of memcmp calls will use the slower variant above. With that said, this
+    // is a hypothesis and is only loosely supported by benchmarks. There's
+    // likely some improvement that could be made here. The main thing here
+    // though is to optimize for latency, not throughput.
+
+    // SAFETY: Via the conditional above, we know that both `px` and `py`
+    // have the same length, so `px < pxend` implies that `py < pyend`.
+    // Thus, derefencing both `px` and `py` in the loop below is safe.
+    //
+    // Moreover, we set `pxend` and `pyend` to be 4 bytes before the actual
+    // end of of `px` and `py`. Thus, the final dereference outside of the
+    // loop is guaranteed to be valid. (The final comparison will overlap with
+    // the last comparison done in the loop for lengths that aren't multiples
+    // of four.)
+    //
+    // Finally, we needn't worry about alignment here, since we do unaligned
+    // loads.
+    unsafe {
+        let (mut px, mut py) = (x.as_ptr(), y.as_ptr());
+        let (pxend, pyend) = (px.add(x.len() - 4), py.add(y.len() - 4));
+        while px < pxend {
+            let vx = (px as *const u32).read_unaligned();
+            let vy = (py as *const u32).read_unaligned();
+            if vx != vy {
+                return false;
+            }
+            px = px.add(4);
+            py = py.add(4);
+        }
+        let vx = (pxend as *const u32).read_unaligned();
+        let vy = (pyend as *const u32).read_unaligned();
+        vx == vy
+    }
+}
@@ -0,0 +1,98 @@
+/// A trait for describing vector operations used by vectorized searchers.
+///
+/// The trait is highly constrained to low level vector operations needed. In
+/// general, it was invented mostly to be generic over x86's __m128i and
+/// __m256i types. It's likely that once std::simd becomes a thing, we can
+/// migrate to that since the operations required are quite simple.
+///
+/// TODO: Consider moving this trait up a level and using it to implement
+/// memchr as well. The trait might need to grow one or two methods, but
+/// otherwise should be close to sufficient already.
+///
+/// # Safety
+///
+/// All methods are not safe since they are intended to be implemented using
+/// vendor intrinsics, which are also not safe. Callers must ensure that the
+/// appropriate target features are enabled in the calling function, and that
+/// the current CPU supports them. All implementations should avoid marking the
+/// routines with #[target_feature] and instead mark them as #[inline(always)]
+/// to ensure they get appropriately inlined. (inline(always) cannot be used
+/// with target_feature.)
+pub(crate) trait Vector: Copy + core::fmt::Debug {
+    /// _mm_set1_epi8 or _mm256_set1_epi8
+    unsafe fn splat(byte: u8) -> Self;
+    /// _mm_loadu_si128 or _mm256_loadu_si256
+    unsafe fn load_unaligned(data: *const u8) -> Self;
+    /// _mm_movemask_epi8 or _mm256_movemask_epi8
+    unsafe fn movemask(self) -> u32;
+    /// _mm_cmpeq_epi8 or _mm256_cmpeq_epi8
+    unsafe fn cmpeq(self, vector2: Self) -> Self;
+    /// _mm_and_si128 or _mm256_and_si256
+    unsafe fn and(self, vector2: Self) -> Self;
+}
+
+#[cfg(target_arch = "x86_64")]
+mod x86sse {
+    use super::Vector;
+    use core::arch::x86_64::*;
+
+    impl Vector for __m128i {
+        #[inline(always)]
+        unsafe fn splat(byte: u8) -> __m128i {
+            _mm_set1_epi8(byte as i8)
+        }
+
+        #[inline(always)]
+        unsafe fn load_unaligned(data: *const u8) -> __m128i {
+            _mm_loadu_si128(data as *const __m128i)
+        }
+
+        #[inline(always)]
+        unsafe fn movemask(self) -> u32 {
+            _mm_movemask_epi8(self) as u32
+        }
+
+        #[inline(always)]
+        unsafe fn cmpeq(self, vector2: Self) -> __m128i {
+            _mm_cmpeq_epi8(self, vector2)
+        }
+
+        #[inline(always)]
+        unsafe fn and(self, vector2: Self) -> __m128i {
+            _mm_and_si128(self, vector2)
+        }
+    }
+}
+
+#[cfg(all(feature = "std", target_arch = "x86_64"))]
+mod x86avx {
+    use super::Vector;
+    use core::arch::x86_64::*;
+
+    impl Vector for __m256i {
+        #[inline(always)]
+        unsafe fn splat(byte: u8) -> __m256i {
+            _mm256_set1_epi8(byte as i8)
+        }
+
+        #[inline(always)]
+        unsafe fn load_unaligned(data: *const u8) -> __m256i {
+            _mm256_loadu_si256(data as *const __m256i)
+        }
+
+        #[inline(always)]
+        unsafe fn movemask(self) -> u32 {
+            _mm256_movemask_epi8(self) as u32
+        }
+
+        #[inline(always)]
+        unsafe fn cmpeq(self, vector2: Self) -> __m256i {
+            _mm256_cmpeq_epi8(self, vector2)
+        }
+
+        #[inline(always)]
+        unsafe fn and(self, vector2: Self) -> __m256i {
+            _mm256_and_si256(self, vector2)
+        }
+    }
+}
@@ -0,0 +1,139 @@
+#[cfg(not(feature = "std"))]
+pub(crate) use self::nostd::Forward;
+#[cfg(feature = "std")]
+pub(crate) use self::std::Forward;
+
+#[cfg(feature = "std")]
+mod std {
+    use core::arch::x86_64::{__m128i, __m256i};
+
+    use crate::memmem::{genericsimd, NeedleInfo};
+
+    /// An AVX accelerated vectorized substring search routine that only works
+    /// on small needles.
+    #[derive(Clone, Copy, Debug)]
+    pub(crate) struct Forward(genericsimd::Forward);
+
+    impl Forward {
+        /// Create a new "generic simd" forward searcher. If one could not be
+        /// created from the given inputs, then None is returned.
+        pub(crate) fn new(
+            ninfo: &NeedleInfo,
+            needle: &[u8],
+        ) -> Option<Forward> {
+            if !cfg!(memchr_runtime_avx) || !is_x86_feature_detected!("avx2") {
+                return None;
+            }
+            genericsimd::Forward::new(ninfo, needle).map(Forward)
+        }
+
+        /// Returns the minimum length of haystack that is needed for this
+        /// searcher to work. Passing a haystack with a length smaller than
+        /// this will cause `find` to panic.
+        #[inline(always)]
+        pub(crate) fn min_haystack_len(&self) -> usize {
+            self.0.min_haystack_len::<__m128i>()
+        }
+
+        #[inline(always)]
+        pub(crate) fn find(
+            &self,
+            haystack: &[u8],
+            needle: &[u8],
+        ) -> Option<usize> {
+            // SAFETY: The only way a Forward value can exist is if the avx2
+            // target feature is enabled. This is the only safety requirement
+            // for calling the genericsimd searcher.
+            unsafe { self.find_impl(haystack, needle) }
+        }
+
+        /// The implementation of find marked with the appropriate target
+        /// feature.
+        ///
+        /// # Safety
+        ///
+        /// Callers must ensure that the avx2 CPU feature is enabled in the
+        /// current environment.
+        #[target_feature(enable = "avx2")]
+        unsafe fn find_impl(
+            &self,
+            haystack: &[u8],
+            needle: &[u8],
+        ) -> Option<usize> {
+            if haystack.len() < self.0.min_haystack_len::<__m256i>() {
+                genericsimd::fwd_find::<__m128i>(&self.0, haystack, needle)
+            } else {
+                genericsimd::fwd_find::<__m256i>(&self.0, haystack, needle)
+            }
+        }
+    }
+}
+
+// We still define the avx "forward" type on nostd to make caller code a bit
+// simpler. This avoids needing a lot more conditional compilation.
+#[cfg(not(feature = "std"))]
+mod nostd {
+    use crate::memmem::NeedleInfo;
+
+    #[derive(Clone, Copy, Debug)]
+    pub(crate) struct Forward(());
+
+    impl Forward {
+        pub(crate) fn new(
+            ninfo: &NeedleInfo,
+            needle: &[u8],
+        ) -> Option<Forward> {
+            None
+        }
+
+        pub(crate) fn min_haystack_len(&self) -> usize {
+            unreachable!()
+        }
+
+        pub(crate) fn find(
+            &self,
+            haystack: &[u8],
+            needle: &[u8],
+        ) -> Option<usize> {
+            unreachable!()
+        }
+    }
+}
+
+#[cfg(all(test, feature = "std", not(miri)))]
+mod tests {
+    use crate::memmem::{prefilter::PrefilterState, NeedleInfo};
+
+    fn find(
+        _: &mut PrefilterState,
+        ninfo: &NeedleInfo,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        super::Forward::new(ninfo, needle).unwrap().find(haystack, needle)
+    }
+
+    #[test]
+    fn prefilter_permutations() {
+        use crate::memmem::prefilter::tests::PrefilterTest;
+
+        if !is_x86_feature_detected!("avx2") {
+            return;
+        }
+        // SAFETY: The safety of find only requires that the current CPU
+        // support AVX2, which we checked above.
+        unsafe {
+            PrefilterTest::run_all_tests_filter(find, |t| {
+                // This substring searcher only works on certain configs, so
+                // filter our tests such that Forward::new will be guaranteed
+                // to succeed. (And also remove tests with a haystack that is
+                // too small.)
+                let fwd = match super::Forward::new(&t.ninfo, &t.needle) {
+                    None => return false,
+                    Some(fwd) => fwd,
+                };
+                t.haystack.len() >= fwd.min_haystack_len()
+            })
+        }
+    }
+}
@@ -0,0 +1,2 @@
+pub(crate) mod avx;
+pub(crate) mod sse;
@@ -0,0 +1,89 @@
+use core::arch::x86_64::__m128i;
+
+use crate::memmem::{genericsimd, NeedleInfo};
+
+/// An SSE accelerated vectorized substring search routine that only works on
+/// small needles.
+#[derive(Clone, Copy, Debug)]
+pub(crate) struct Forward(genericsimd::Forward);
+
+impl Forward {
+    /// Create a new "generic simd" forward searcher. If one could not be
+    /// created from the given inputs, then None is returned.
+    pub(crate) fn new(ninfo: &NeedleInfo, needle: &[u8]) -> Option<Forward> {
+        if !cfg!(memchr_runtime_sse2) {
+            return None;
+        }
+        genericsimd::Forward::new(ninfo, needle).map(Forward)
+    }
+
+    /// Returns the minimum length of haystack that is needed for this searcher
+    /// to work. Passing a haystack with a length smaller than this will cause
+    /// `find` to panic.
+    #[inline(always)]
+    pub(crate) fn min_haystack_len(&self) -> usize {
+        self.0.min_haystack_len::<__m128i>()
+    }
+
+    #[inline(always)]
+    pub(crate) fn find(
+        &self,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        // SAFETY: sse2 is enabled on all x86_64 targets, so this is always
+        // safe to call.
+        unsafe { self.find_impl(haystack, needle) }
+    }
+
+    /// The implementation of find marked with the appropriate target feature.
+    ///
+    /// # Safety
+    ///
+    /// This is safe to call in all cases since sse2 is guaranteed to be part
+    /// of x86_64. It is marked as unsafe because of the target feature
+    /// attribute.
+    #[target_feature(enable = "sse2")]
+    unsafe fn find_impl(
+        &self,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        genericsimd::fwd_find::<__m128i>(&self.0, haystack, needle)
+    }
+}
+
+#[cfg(all(test, feature = "std", not(miri)))]
+mod tests {
+    use crate::memmem::{prefilter::PrefilterState, NeedleInfo};
+
+    fn find(
+        _: &mut PrefilterState,
+        ninfo: &NeedleInfo,
+        haystack: &[u8],
+        needle: &[u8],
+    ) -> Option<usize> {
+        super::Forward::new(ninfo, needle).unwrap().find(haystack, needle)
+    }
+
+    #[test]
+    fn prefilter_permutations() {
+        use crate::memmem::prefilter::tests::PrefilterTest;
+
+        // SAFETY: sse2 is enabled on all x86_64 targets, so this is always
+        // safe to call.
+        unsafe {
+            PrefilterTest::run_all_tests_filter(find, |t| {
+                // This substring searcher only works on certain configs, so
+                // filter our tests such that Forward::new will be guaranteed
+                // to succeed. (And also remove tests with a haystack that is
+                // too small.)
+                let fwd = match super::Forward::new(&t.ninfo, &t.needle) {
+                    None => return false,
+                    Some(fwd) => fwd,
+                };
+                t.haystack.len() >= fwd.min_haystack_len()
+            })
+        }
+    }
+}
@@ -1,7 +1,9 @@
 use quickcheck::quickcheck;

 use crate::{
-    fallback, memchr, memchr2, memchr3, memrchr, memrchr2, memrchr3, naive,
+    memchr,
+    memchr::{fallback, naive},
+    memchr2, memchr3, memrchr, memrchr2, memrchr3,
    tests::memchr::testdata::memchr_tests,
 };

@@ -1,119 +0,0 @@
-use crate::fallback;
-
-// We only use AVX when we can detect at runtime whether it's available, which
-// requires std.
-#[cfg(feature = "std")]
-mod avx;
-mod sse2;
-
-// This macro employs a gcc-like "ifunc" trick where by upon first calling
-// `memchr` (for example), CPU feature detection will be performed at runtime
-// to determine the best implementation to use. After CPU feature detection
-// is done, we replace `memchr`'s function pointer with the selection. Upon
-// subsequent invocations, the CPU-specific routine is invoked directly, which
-// skips the CPU feature detection and subsequent branch that's required.
-//
-// While this typically doesn't matter for rare occurrences or when used on
-// larger haystacks, `memchr` can be called in tight loops where the overhead
-// of this branch can actually add up *and is measurable*. This trick was
-// necessary to bring this implementation up to glibc's speeds for the 'tiny'
-// benchmarks, for example.
-//
-// At some point, I expect the Rust ecosystem will get a nice macro for doing
-// exactly this, at which point, we can replace our hand-jammed version of it.
-//
-// N.B. The ifunc strategy does prevent function inlining of course, but on
-// modern CPUs, you'll probably end up with the AVX2 implementation, which
-// probably can't be inlined anyway---unless you've compiled your entire
-// program with AVX2 enabled. However, even then, the various memchr
-// implementations aren't exactly small, so inlining might not help anyway!
-#[cfg(feature = "std")]
-macro_rules! ifunc {
-    ($fnty:ty, $name:ident, $haystack:ident, $($needle:ident),+) => {{
-        use std::mem;
-        use std::sync::atomic::{AtomicPtr, Ordering};
-
-        type FnRaw = *mut ();
-
-        static FN: AtomicPtr<()> = AtomicPtr::new(detect as FnRaw);
-
-        fn detect($($needle: u8),+, haystack: &[u8]) -> Option<usize> {
-            let fun =
-                if cfg!(memchr_runtime_avx) && is_x86_feature_detected!("avx2") {
-                    avx::$name as FnRaw
-                } else if cfg!(memchr_runtime_sse2) {
-                    sse2::$name as FnRaw
-                } else {
-                    fallback::$name as FnRaw
-                };
-            FN.store(fun as FnRaw, Ordering::Relaxed);
-            unsafe {
-                mem::transmute::<FnRaw, $fnty>(fun)($($needle),+, haystack)
-            }
-        }
-
-        unsafe {
-            let fun = FN.load(Ordering::Relaxed);
-            mem::transmute::<FnRaw, $fnty>(fun)($($needle),+, $haystack)
-        }
-    }}
-}
-
-// When std isn't available to provide runtime CPU feature detection, or if
-// runtime CPU feature detection has been explicitly disabled, then just call
-// our optimized SSE2 routine directly. SSE2 is avalbale on all x86_64 targets,
-// so no CPU feature detection is necessary.
-#[cfg(not(feature = "std"))]
-macro_rules! ifunc {
-    ($fnty:ty, $name:ident, $haystack:ident, $($needle:ident),+) => {{
-        if cfg!(memchr_runtime_sse2) {
-            unsafe { sse2::$name($($needle),+, $haystack) }
-        } else {
-            fallback::$name($($needle),+, $haystack)
-        }
-    }}
-}
-
-#[inline(always)]
-pub fn memchr(n1: u8, haystack: &[u8]) -> Option<usize> {
-    ifunc!(fn(u8, &[u8]) -> Option<usize>, memchr, haystack, n1)
-}
-
-#[inline(always)]
-pub fn memchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
-    ifunc!(fn(u8, u8, &[u8]) -> Option<usize>, memchr2, haystack, n1, n2)
-}
-
-#[inline(always)]
-pub fn memchr3(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
-    ifunc!(
-        fn(u8, u8, u8, &[u8]) -> Option<usize>,
-        memchr3,
-        haystack,
-        n1,
-        n2,
-        n3
-    )
-}
-
-#[inline(always)]
-pub fn memrchr(n1: u8, haystack: &[u8]) -> Option<usize> {
-    ifunc!(fn(u8, &[u8]) -> Option<usize>, memrchr, haystack, n1)
-}
-
-#[inline(always)]
-pub fn memrchr2(n1: u8, n2: u8, haystack: &[u8]) -> Option<usize> {
-    ifunc!(fn(u8, u8, &[u8]) -> Option<usize>, memrchr2, haystack, n1, n2)
-}
-
-#[inline(always)]
-pub fn memrchr3(n1: u8, n2: u8, n3: u8, haystack: &[u8]) -> Option<usize> {
-    ifunc!(
-        fn(u8, u8, u8, &[u8]) -> Option<usize>,
-        memrchr3,
-        haystack,
-        n1,
-        n2,
-        n3
-    )
-}
				`@@ -0,0 +1 @@`
				zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz