Overall, still a mixed bag performance wise: nice speedups (5-11%) on
medium and large sizes, but slowdowns on 100 bytes and below for the
existing methods. However, decode_config_slice is a big win, especially
on smaller sizes: the best case for decoding 3 bytes goes from 166MiB/s
on master to 222 MiB/s. (Meanwhile the worst case goes from 95 MiB/s to
50 MiB/s, but I really doubt decoding base64 3 bytes at a time is
anyone's hotspot.)
Because this involved splitting out a separate helper function to do
the core of decode_config_buf and decode_config_slice, we seem to get
bit by some optimiser heuristics that get very fussy about inlining
hints and minor code changes, so that's something to keep an eye on.
Logic for when the fast decode logic could be used was tightened to
allow using them on shorter but still legal inputs.
Make random_config set strip_whitespace based off of line wrapping,
so no need to separately strip whitespace before decoding.
This opens the door to decoding to a slice, where such writes can't be
papered over by truncating a vec.
This incurs a minor performance cost (normally 0-1%, but up to 8-9% on 3
byte decodes). Perf tuning to follow.
The old tests that exhaustively check strings a couple bytes long
weren't that useful, and only checked one config. Using the random
config helper in src/tests.rs is a better use of wall clock time
when waiting for tests to run.
Encoded bytes are moved from the end to the front so each byte is
only moved once.
Encoding is somewhat rearranged to operate on a slice into the
output buffer. This makes it easier to avoid clobbering any
existing bytes in the buffer, as well as paving the way to slice-
based encoding needed for a Display wrapper, stream adapters, etc.
This brings big speedups: over 2x at 100 byte inputs to 6x by 3KiB
and larger. A consequence of this change in logic is that internal
padding characters (a '=' in the middle of base64) is now rejected.
This behavior is allowed per the RFC (4648 s. 3.3), but such
characters were silently ignored before.
On i7-6850K I'm seeing >3 IPC with 0.01% branch mispredict on the
10MiB test. Old code had 1.4 IPC with a pretty hefty 8.46% branch
mispredict.