The previous XXH3_accumulate_512 loop didn't fare well since XXH128
started swapping the addition.
Neither GCC nor Clang could follow the barely-readable loop, resulting
in garbage code output.
This made XXH3 much slower. Take 32-bit scalar ARM.
Ignoring loads and potential interleaving optimizations, in the main
loop, XXH32 takes 16 cycles for 8 bytes on a typical ARMv6+ CPU, or 2 cpb.
```asm
mla r0, r2, r5, r0 @ 4 cycles
ror r0, r0, #19 @ 1 cycle
mul r0, r0, r6 @ 3 cycles
mla r1, r3, r5, r1 @ 4 cycles
ror r1, r1, #19 @ 1 cycle
mul r1, r1, r6 @ 3 cycles
```
XXH3_64b takes 9, or 1.1 cpb:
```asm
adds r0, r0, r2 @ 2 cycles
adc r1, r1, r3 @ 1 cycle
eor r4, r4, r2 @ 1 cycle
eor r5, r5, r3 @ 1 cycle
umlal r0, r1, r4, r5 @ 4 cycles
```
Benchmarking on a Pixel 2 XL (with a binary for ARMv4T), previously,
XXH32 got 1.8 GB/s, while XXH3_64b got 1.7.
Now, XXH3_64b gets 2.3 GB/s! This calculates out well (as additional
loads and stores have some overhead).
Unlike before, it is better to disable autovectorization completely, as
the compiler can't vectorize it as well. (Especially with Clang and
NEON, where it extracts to multiply instead of the obvious vmlal.u32!).
On that same device in aarch64 mode XXH3's scalar version when compiled
with `clang-8 -O3 -DXXH_VECTOR=0 -fno-vectorize -fno-slp-vectorize`,
XXH3 went from 2.3 GB/s to 4.3 GB/s. For comparison, the NEON version
gets 6.0 GB/s.
However, almost all platforms with decent autovectorization have a
handwritten intrinsics version which is much faster.
For optimal performance, use -fno-tree-vectorize -fno-tree-slp-vectorize
(or simply disable SIMD instructions entirely).
From testing, ARM32 also prefers forced inlining, so I enabled it.
I also fixed some typos.
Previously, XXH3_64bits looked much faster than XXH3_128bits. The truth
is that they are similar in long keys. The difference was that
XXH3_64b's benchmark was unseeded, putting it at an unfair advantage
over XXH128 which is seeded.
I don't think I am going to do the dummy bench. That made things moe
complicated.
Sorry about the disorganized commit. :(
Yet again, I had to fix ARMv6. Clang went from ldm to ldrd which
also bus errors.
Therefore, I decided to fix the root problem and remove the
XXH_FORCE_DIRECT_MEMORY_ACCESS hack, using only memcpy.
This will kill alignment memes for good, and besides, it didn't
seem to make much of a difference.
Additionally, I added my better 128-bit long multiply
and applied DRY to XXH3_mul128_fold64. This also removes
the cryptic inline assembly hack.
Each method was documented, too (we need more comments).
Also, I added a warning for users who are compiling Thumb-1
code for a target supporting ARM instructions.
While all versions of ARM and Thumb-2 meet XXH3's base requirements,
Thumb-1 does not.
First of all, UMULL is inaccessible in the 16-bit subset. This means
that every XXH_mult32to64 means a call to __aeabi_lmul.
Since everything operation in XXH3 needs to happen in the Lo registers
plus having to setup r0-r3 many times for __aeabi_lmul, the output
resembles a game of Rush Hour:
$ clang -O3 -S --target=arm-none-eabi -march=armv4t -mthumb xxhash.c
$ grep -c mov xxhash.s
5472
$ clang -O3 -S --target=arm-none-eabi -march=armv4t xxhash.c
$ grep -c mov xxhash.s
2071
It is much more practical to compile xxHash with the wider instruction
sets, as these restrictions do not apply.
This doesn't warn if ARMv6-M is targeted; Thumb-1 is unavoidable.
Lastly, I removed the pragma clang loop hack which didn't work anymore
since the number of iterations can't be constant evaluated. Now, we
don't have 20 warnings when compiling for x86.
Clang was using ldmia and ldrd on unaligned pointers. These
instructions don't support unaligned access.
I also check the numerical value of __ARM_ARCH.