We know it exists, don't hide it.
It is highly unlikely to occur with proper seeding and random inputs,
and it doesn't occur on the 128-bit version, so make sure people are
aware of it.
Comments are now synchronized across all SIMD implementations, and both
now have a summary block comment.
Additionally, VSX now uses xxh_u64x2 to match the scalar typedefs.
it's not useful to swap input segments
the differentiation from seed is already taken care of by the seed itself
and keeping number in the low bits slightly improves dispersion.
Also may improve speed for specific case len=8 (constant)
GCC for AVX2 goes overboard on the unrolling with -O3, causing slower
code than MSVC and Clang.
We can override that with a pragma that forces GCC to use -O2 instead.
Note that GCC still generates the best scalar and SSE2 code with -O3.
I also mentioned the fact that GCC will split _mm256_loadu_si256 into
two instructions on a generic+avx2 target (which is an optimization that
only applies to the non-AVX2 Sandy and Ivy Bridge chips), and provide
the recommended flags.
XXH_FORCE_MEMORY_ACCESS==3 will use a byteshift operation. This is
preferred on older compilers which don't inline `memcpy()` or some
big-endian systems without a native byteswap.
Also fix a small typo.
Old/stupid compilers may generate an erroneous mask in XXH_mult32to64,
e.g. ARM GCC 2.95:
```c
xxh_u64 XXH_mult32to64(xxh_u64 a, xxh_u64 b)
{
return (a & 0xffffffff) * (b & 0xffffffff);
}
```
`arm-gcc-2.95 -O3 -S -march=armv4t -mcpu=arm7tdmi -fomit-frame-pointer`
```asm
XXH_mult32to64:
push {r4, r5, r6, r7, lr}
mov r5, #0
mov r4, #0xffffffff
mov r7, r5
mov r6, r4
@ mask 32-bit registers by 0x00000000 and 0xffffffff ?!?!?!
and r6, r6, r0
and r7, r7, r1
and r4, r4, r2
and r5, r5, r3
@ full 64x64->64 multiply
umull r0, r1, r6, r4
mla r1, r6, r5, r1
mla r1, r4, r7, r1
pop {r4, r5, r6, r7, pc}
```
Meanwhile, using a downcast followed by an upcast generates the expected
code, albeit with some understandable regalloc weirdness (ARM support
was only recently added).
```c
xxh_u64 XXH_mult32to64(xxh_u64 a, xxh_u64 b)
{
return (xxh_u64)(xxh_u32)a * (xxh_u64)(xxh_u32)b;
}
```
`arm-gcc-2.95 -O3 -S -march=armv4t -mcpu=arm7tdmi -fomit-frame-pointer`
```asm
XXH_mult32to64:
push {r4, lr}
umull r3, r4, r0, r2
mov r1, r4
mov r0, r3
pop {r4, pc}
```
Switching to this implementation may also remove the requirement for
`__emulu` on MSVC x86, but it hasn't been tested yet.
All modern compilers should recognize both patterns, but it seems that
old 32-bit compilers will prefer the latter, making this a free
optimization.
This new test doesn't use any Unicode in the source files, instead
encoding all UTF-8 and UTF-16 as hex.
The test script will be generated from a C file, in which both a shell
script and a batch script will be generated, as well as the Unicode file
to test.
On Cygwin, MinGW, and MSYS, we will automatically bail from the shell
script to the batch script, as cmd.exe has more reliable Unicode
support, at least on Windows 7 and later.
When the make rule is called, it first checks if `$LANG` contains UTF-8,
defining the (overridable) ENABLE_UNICODE flag. If so, it will skip the
test with a warning.
Also fixed an issue with printf in multiInclude.c causing warnings on
old MinGW versions which expect %I64, and updated the .gitignore.