mirror of
https://gitee.com/openharmony/third_party_rust_regex
synced 2025-04-08 05:01:36 +00:00

The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189
83 lines
2.4 KiB
Python
Executable File
83 lines
2.4 KiB
Python
Executable File
#!/usr/bin/env python
|
|
|
|
# This does simple normalized frequency analysis on UTF-8 encoded text. The
|
|
# result of the analysis is translated to a ranked list, where every byte is
|
|
# assigned a rank. This list is written to src/freqs.rs.
|
|
#
|
|
# Currently, the frequencies are generated from the following corpuses:
|
|
#
|
|
# * The CIA world fact book
|
|
# * The source code of rustc
|
|
# * Septuaginta
|
|
|
|
from __future__ import absolute_import, division, print_function
|
|
|
|
import argparse
|
|
from collections import Counter
|
|
import sys
|
|
|
|
preamble = '''// Copyright 2012-2015 The Rust Project Developers. See the COPYRIGHT
|
|
// file at the top-level directory of this distribution and at
|
|
// http://rust-lang.org/COPYRIGHT.
|
|
//
|
|
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
|
|
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
|
|
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
|
|
// option. This file may not be copied, modified, or distributed
|
|
// except according to those terms.
|
|
|
|
// NOTE: The following code was generated by "scripts/frequencies.py", do not
|
|
// edit directly
|
|
'''
|
|
|
|
|
|
def eprint(*args, **kwargs):
|
|
kwargs['file'] = sys.stderr
|
|
print(*args, **kwargs)
|
|
|
|
|
|
def main():
|
|
p = argparse.ArgumentParser()
|
|
p.add_argument('corpus', metavar='FILE', nargs='+')
|
|
args = p.parse_args()
|
|
|
|
# Get frequency counts of each byte.
|
|
freqs = Counter()
|
|
for i in range(0, 256):
|
|
freqs[i] = 0
|
|
|
|
eprint('reading entire corpus into memory')
|
|
corpus = []
|
|
for fpath in args.corpus:
|
|
corpus.append(open(fpath, 'rb').read())
|
|
|
|
eprint('computing byte frequencies')
|
|
for c in corpus:
|
|
for byte in c:
|
|
freqs[byte] += 1.0 / float(len(c))
|
|
|
|
eprint('writing Rust code')
|
|
# Get the rank of each byte. A lower rank => lower relative frequency.
|
|
rank = [0] * 256
|
|
for i, (byte, _) in enumerate(freqs.most_common()):
|
|
# print(byte)
|
|
rank[byte] = 255 - i
|
|
|
|
# Forcefully set the highest rank possible for bytes that start multi-byte
|
|
# UTF-8 sequences. The idea here is that a continuation byte will be more
|
|
# discerning in a homogenous haystack.
|
|
for byte in range(0xC0, 0xFF + 1):
|
|
rank[byte] = 255
|
|
|
|
# Now write Rust.
|
|
olines = ['pub const BYTE_FREQUENCIES: [u8; 256] = [']
|
|
for byte in range(256):
|
|
olines.append(' %3d, // %r' % (rank[byte], chr(byte)))
|
|
olines.append('];')
|
|
|
|
print(preamble)
|
|
print('\n'.join(olines))
|
|
|
|
if __name__ == '__main__':
|
|
main()
|