This brings big speedups: over 2x at 100 byte inputs to 6x by 3KiB
and larger. A consequence of this change in logic is that internal
padding characters (a '=' in the middle of base64) is now rejected.
This behavior is allowed per the RFC (4648 s. 3.3), but such
characters were silently ignored before.
On i7-6850K I'm seeing >3 IPC with 0.01% branch mispredict on the
10MiB test. Old code had 1.4 IPC with a pretty hefty 8.46% branch
mispredict.