Sorry about the disorganized commit. :(
Yet again, I had to fix ARMv6. Clang went from ldm to ldrd which
also bus errors.
Therefore, I decided to fix the root problem and remove the
XXH_FORCE_DIRECT_MEMORY_ACCESS hack, using only memcpy.
This will kill alignment memes for good, and besides, it didn't
seem to make much of a difference.
Additionally, I added my better 128-bit long multiply
and applied DRY to XXH3_mul128_fold64. This also removes
the cryptic inline assembly hack.
Each method was documented, too (we need more comments).
Also, I added a warning for users who are compiling Thumb-1
code for a target supporting ARM instructions.
While all versions of ARM and Thumb-2 meet XXH3's base requirements,
Thumb-1 does not.
First of all, UMULL is inaccessible in the 16-bit subset. This means
that every XXH_mult32to64 means a call to __aeabi_lmul.
Since everything operation in XXH3 needs to happen in the Lo registers
plus having to setup r0-r3 many times for __aeabi_lmul, the output
resembles a game of Rush Hour:
$ clang -O3 -S --target=arm-none-eabi -march=armv4t -mthumb xxhash.c
$ grep -c mov xxhash.s
5472
$ clang -O3 -S --target=arm-none-eabi -march=armv4t xxhash.c
$ grep -c mov xxhash.s
2071
It is much more practical to compile xxHash with the wider instruction
sets, as these restrictions do not apply.
This doesn't warn if ARMv6-M is targeted; Thumb-1 is unavoidable.
Lastly, I removed the pragma clang loop hack which didn't work anymore
since the number of iterations can't be constant evaluated. Now, we
don't have 20 warnings when compiling for x86.
The VSX codepath is now working on POWER8 and is fully enabled.
The little endian code has been verified on POWER8E, although
a big endian machine was not available.
This uses vpermxor from POWER8 to shuffle on big endian.
There are a few other fixes as well to unify endian memes.
- seems to produce same results as non-streaming functions,
- the 128-bit non-streaming ones don't support "custom secret", so
neither does the streaming variant
- the 64-bit functions seem to do something more clever in order to
avoid leaking the key/secret, which none of current 128-bit functions
do
- the naming of the streaming functions is a bit weird now, since
most of the ones with "64" in the name are what should be used in
128 bit case too
I was testing the non-streaming XXH3 on 512 byte block, and a streaming API that happened to do a "3 bytes update, then 509 bytes update", and the result was different. Looks like `bufferedSize` was not properly updated if the incoming block size landed exactly on multiple of the internal buffer size. With this proposed fix it seems to be better
recommended by @aras-p.
I could not reproduce the issue,
but removing the unroll statement doesn't hurt wasm anyway,
so let's remove it for emscripten.