Andrew Gallant 31a317eadd Major literal optimization refactoring.
The principle change in this commit is a complete rewrite of how
literals are detected from a regular expression. In particular, we now
traverse the abstract syntax to discover literals instead of the
compiled byte code. This permits more tuneable control over which and
how many literals are extracted, and is now exposed in the
`regex-syntax` crate so that others can benefit from it.

Other changes in this commit:

* The Boyer-Moore algorithm was rewritten to use my own concoction based
  on frequency analysis. We end up regressing on a couple benchmarks
  slightly because of this, but gain in some others and in general should
  be faster in a broader number of cases. (Principally because we try to
  run `memchr` on the rarest byte in a literal.) This should also greatly
  improve handling of non-Western text.
* A "reverse suffix" literal optimization was added. That is, if suffix
  literals exist but no prefix literals exist, then we can quickly scan
  for suffix matches and then run the DFA in reverse to find matches.
  (I'm not aware of any other regex engine that does this.)
* The mutex-based pool has been replaced with a spinlock-based pool
  (from the new `mempool` crate). This reduces some amount of constant
  overhead and improves several benchmarks that either search short
  haystacks or find many matches in long haystacks.
* Search parameters have been refactored.
* RegexSet can now contain 0 or more regular expressions (previously, it
  could only contain 2 or more). The InvalidSet error variant is now
  deprecated.
* A bug in computing start states was fixed. Namely, the DFA assumed the
  start states was always the first instruction, which is trivially
  wrong for an expression like `^☃$`. This bug persisted because it
  typically occurred when a literal optimization would otherwise run.
* A new CLI tool, regex-debug, has been added as a non-published
  sub-crate. The CLI tool can answer various facts about regular
  expressions, such as printing its AST, its compiled byte code or its
  detected literals.

Closes #96, #188, #189
2016-03-27 20:07:46 -04:00

83 lines
2.4 KiB
Python
Executable File

#!/usr/bin/env python
# This does simple normalized frequency analysis on UTF-8 encoded text. The
# result of the analysis is translated to a ranked list, where every byte is
# assigned a rank. This list is written to src/freqs.rs.
#
# Currently, the frequencies are generated from the following corpuses:
#
# * The CIA world fact book
# * The source code of rustc
# * Septuaginta
from __future__ import absolute_import, division, print_function
import argparse
from collections import Counter
import sys
preamble = '''// Copyright 2012-2015 The Rust Project Developers. See the COPYRIGHT
// file at the top-level directory of this distribution and at
// http://rust-lang.org/COPYRIGHT.
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.
// NOTE: The following code was generated by "scripts/frequencies.py", do not
// edit directly
'''
def eprint(*args, **kwargs):
kwargs['file'] = sys.stderr
print(*args, **kwargs)
def main():
p = argparse.ArgumentParser()
p.add_argument('corpus', metavar='FILE', nargs='+')
args = p.parse_args()
# Get frequency counts of each byte.
freqs = Counter()
for i in range(0, 256):
freqs[i] = 0
eprint('reading entire corpus into memory')
corpus = []
for fpath in args.corpus:
corpus.append(open(fpath, 'rb').read())
eprint('computing byte frequencies')
for c in corpus:
for byte in c:
freqs[byte] += 1.0 / float(len(c))
eprint('writing Rust code')
# Get the rank of each byte. A lower rank => lower relative frequency.
rank = [0] * 256
for i, (byte, _) in enumerate(freqs.most_common()):
# print(byte)
rank[byte] = 255 - i
# Forcefully set the highest rank possible for bytes that start multi-byte
# UTF-8 sequences. The idea here is that a continuation byte will be more
# discerning in a homogenous haystack.
for byte in range(0xC0, 0xFF + 1):
rank[byte] = 255
# Now write Rust.
olines = ['pub const BYTE_FREQUENCIES: [u8; 256] = [']
for byte in range(256):
olines.append(' %3d, // %r' % (rank[byte], chr(byte)))
olines.append('];')
print(preamble)
print('\n'.join(olines))
if __name__ == '__main__':
main()