- Loads are ugly. I haven't found any good documentation about
unaligned loads.
- Hopefully reduce the conditionals
I mostly want to test on Travis, as I don't have an s390x toolchain
at the moment.
which do not work using c90 strict mode,
due to the (incorrect) presence of `inline` keyword
in some standard library's header files.
The previous method was disabling the `inline` keyword,
but this introduces other problems with more complex multi-files project,
such as benchHash, which has been recently added as part of `make test`.
Added a new environment variable to disable the c90 compatibility test :
NO_C90_TEST=true
note : apparently, Appveyor doesn't like comments inside () sub-blocks :(
instead of xxhash.c .
This seems preferable for some build systems,
which don't like the `#include "xxhash.c"` statement
when inlining xxhash, as reported by @pdillinger .
Note that `xxhash.c` still exists,
it just includes the implementation and instantiates it.
Fixes#258.
```c
BYTE -> xxh_u8
U32 -> xxh_u32
U64 -> xxh_u64
```
Additionally, I hopefully fixed an issue for targets where int is 16
bits. XXH32 used unsigned int for its seed, and in C90 mode, unsigned
int as its U32. This would cause truncation issues. I check limits.h in
C90 mode to make sure UINT_MAX == 0xFFFFFFFFUL, and if it isn't, use
unsigned long.
We should see if we can set up an AVR CI test. Just to run the
verification program, though, as the benchmark will take a very long
time.
Lastly, the seed types are XXH32_hash_t and XXH64_hash_t for XXH32/64.
This matches xxhash.c and prevents the aforementioned 16-bit int bug.
The previous XXH3_accumulate_512 loop didn't fare well since XXH128
started swapping the addition.
Neither GCC nor Clang could follow the barely-readable loop, resulting
in garbage code output.
This made XXH3 much slower. Take 32-bit scalar ARM.
Ignoring loads and potential interleaving optimizations, in the main
loop, XXH32 takes 16 cycles for 8 bytes on a typical ARMv6+ CPU, or 2 cpb.
```asm
mla r0, r2, r5, r0 @ 4 cycles
ror r0, r0, #19 @ 1 cycle
mul r0, r0, r6 @ 3 cycles
mla r1, r3, r5, r1 @ 4 cycles
ror r1, r1, #19 @ 1 cycle
mul r1, r1, r6 @ 3 cycles
```
XXH3_64b takes 9, or 1.1 cpb:
```asm
adds r0, r0, r2 @ 2 cycles
adc r1, r1, r3 @ 1 cycle
eor r4, r4, r2 @ 1 cycle
eor r5, r5, r3 @ 1 cycle
umlal r0, r1, r4, r5 @ 4 cycles
```
Benchmarking on a Pixel 2 XL (with a binary for ARMv4T), previously,
XXH32 got 1.8 GB/s, while XXH3_64b got 1.7.
Now, XXH3_64b gets 2.3 GB/s! This calculates out well (as additional
loads and stores have some overhead).
Unlike before, it is better to disable autovectorization completely, as
the compiler can't vectorize it as well. (Especially with Clang and
NEON, where it extracts to multiply instead of the obvious vmlal.u32!).
On that same device in aarch64 mode XXH3's scalar version when compiled
with `clang-8 -O3 -DXXH_VECTOR=0 -fno-vectorize -fno-slp-vectorize`,
XXH3 went from 2.3 GB/s to 4.3 GB/s. For comparison, the NEON version
gets 6.0 GB/s.
However, almost all platforms with decent autovectorization have a
handwritten intrinsics version which is much faster.
For optimal performance, use -fno-tree-vectorize -fno-tree-slp-vectorize
(or simply disable SIMD instructions entirely).
From testing, ARM32 also prefers forced inlining, so I enabled it.
I also fixed some typos.
Previously, XXH3_64bits looked much faster than XXH3_128bits. The truth
is that they are similar in long keys. The difference was that
XXH3_64b's benchmark was unseeded, putting it at an unfair advantage
over XXH128 which is seeded.
I don't think I am going to do the dummy bench. That made things moe
complicated.