Commit Graph

700 Commits

Author SHA1 Message Date
Yann Collet
b604c7bee5
Merge pull request #269 from easyaspi314/endianness_fix
Fix endianness detection on GCC, avoid XXH_cpuIsLittleEndian.
2019-10-04 13:38:51 -07:00
easyaspi314 (Devin)
028c0fd534 Fix endianness detection on GCC, avoid XXH_cpuIsLittleEndian. 2019-10-04 09:47:09 -04:00
Yann Collet
71a8150b6f
Merge pull request #267 from easyaspi314/main_loop_cleanup
[SCALAR] Improve scalar XXH3_accumulate_512 loop
2019-10-03 10:38:43 -07:00
easyaspi314 (Devin)
9b6fa1067f [SCALAR] Improve scalar XXH3_accumulate_512 loop
The previous XXH3_accumulate_512 loop didn't fare well since XXH128
started swapping the addition.

Neither GCC nor Clang could follow the barely-readable loop, resulting
in garbage code output.

This made XXH3 much slower. Take 32-bit scalar ARM.

Ignoring loads and potential interleaving optimizations, in the main
loop, XXH32 takes 16 cycles for 8 bytes on a typical ARMv6+ CPU, or 2 cpb.

```asm
        mla     r0, r2, r5, r0  @ 4 cycles
	ror     r0, r0, #19     @ 1 cycle
	mul     r0, r0, r6      @ 3 cycles
	mla     r1, r3, r5, r1  @ 4 cycles
	ror     r1, r1, #19     @ 1 cycle
	mul     r1, r1, r6      @ 3 cycles
```

XXH3_64b takes 9, or 1.1 cpb:
```asm
        adds    r0, r0, r2      @ 2 cycles
	adc     r1, r1, r3      @ 1 cycle
	eor     r4, r4, r2      @ 1 cycle
	eor     r5, r5, r3      @ 1 cycle
	umlal   r0, r1, r4, r5  @ 4 cycles
```

Benchmarking on a Pixel 2 XL (with a binary for ARMv4T), previously,
XXH32 got 1.8 GB/s, while XXH3_64b got 1.7.

Now, XXH3_64b gets 2.3 GB/s! This calculates out well (as additional
loads and stores have some overhead).

Unlike before, it is better to disable autovectorization completely, as
the compiler can't vectorize it as well. (Especially with Clang and
NEON, where it extracts to multiply instead of the obvious vmlal.u32!).

On that same device in aarch64 mode XXH3's scalar version when compiled
with `clang-8 -O3 -DXXH_VECTOR=0 -fno-vectorize -fno-slp-vectorize`,
XXH3 went from 2.3 GB/s to 4.3 GB/s. For comparison, the NEON version
gets 6.0 GB/s.

However, almost all platforms with decent autovectorization have a
handwritten intrinsics version which is much faster.

For optimal performance, use -fno-tree-vectorize -fno-tree-slp-vectorize
(or simply disable SIMD instructions entirely).

From testing, ARM32 also prefers forced inlining, so I enabled it.

I also fixed some typos.
2019-10-03 09:56:18 -04:00
Yann Collet
96e8472380 documented opened API consistency questions 2019-10-02 14:47:59 -07:00
Yann Collet
28950be40c updated code comments
especially on the canonical representation paragraph,
to make it clear it's the preferred format for storage and transmission.
2019-10-02 14:31:14 -07:00
Yann Collet
2b956f86b0
Merge pull request #266 from easyaspi314/varnames
Try to improve some variable names.
2019-10-02 11:28:45 -07:00
Yann Collet
b2154f3583
Merge pull request #265 from easyaspi314/fair_bench
Use both seeded and unseeded variants in the bench
2019-10-02 11:01:31 -07:00
easyaspi314 (Devin)
1367385768 Fix mixed declaration
I need to stop coding before my coffee. :/
2019-10-02 13:01:51 -04:00
easyaspi314 (Devin)
425dbd8d86 Try to improve some variable names.
It's a start, but an improvement. I still have more things I would like
to change but it is good for now.
2019-10-02 12:28:01 -04:00
easyaspi314 (Devin)
91d6e4927e Use both seeded and unseeded variants in the bench
Previously, XXH3_64bits looked much faster than XXH3_128bits. The truth
is that they are similar in long keys. The difference was that
XXH3_64b's benchmark was unseeded, putting it at an unfair advantage
over XXH128 which is seeded.

I don't think I am going to do the dummy bench. That made things moe
complicated.
2019-10-01 23:23:55 -04:00
Yann Collet
3df9e91856
Merge pull request #264 from easyaspi314/voidptrfix
Reduce void pointers and evil casts.
2019-10-01 20:07:53 -07:00
easyaspi314 (Devin)
cb4adfcc10 Typo 2019-10-01 19:00:28 -04:00
easyaspi314 (Devin)
f90b0aba40 Reduce void pointers and evil casts. 2019-10-01 18:52:21 -04:00
Yann Collet
a44629ace1
Merge pull request #262 from Cyan4973/xxh128_17p
improve xxh128 for mid-size
2019-09-30 23:11:01 -07:00
Yann Collet
c8f3fb514c factorized mix32B
changing xxh128 results for len within 129-240.
2019-09-30 22:36:07 -07:00
Yann Collet
9d79fd7bc1 factor mix32 2019-09-30 17:55:46 -07:00
Yann Collet
43b5c76b4c fixed mistake in last ingested segment 2019-09-30 17:33:38 -07:00
Yann Collet
0bed0c2e5b updated self-test values for xxh128 2019-09-30 17:26:04 -07:00
Yann Collet
6896c5798f fix input distribution over 128-bit state
for mid-size length 17+
2019-09-30 17:13:59 -07:00
Yann Collet
cd0f5c2209 slightly updated xxh128 at len 1-3
for a slightly better bias
2019-09-28 20:02:55 -07:00
Yann Collet
ea5c659701 update man page 2019-09-28 17:55:41 -07:00
Yann Collet
eab46160a9 update examples and comment 2019-09-28 17:39:00 -07:00
Yann Collet
384776e4ac
Merge pull request #260 from Cyan4973/xxh128sum
XXH128
2019-09-28 17:23:55 -07:00
Yann Collet
549fca1204 added capability to control XXH128 hashes
added xxh128sum link
2019-09-28 16:49:11 -07:00
Yann Collet
d5336efe31 fixed extraneous ' ' character
failing `-c` verification test
2019-09-28 14:58:07 -07:00
Yann Collet
ce7dbf03e0 improved programming pattern for hashStream 2019-09-28 14:27:32 -07:00
Yann Collet
f2be00e938 update valgrind test 2019-09-27 19:50:40 -07:00
Yann Collet
e098fffe0a fix #259
fix collisions for xxh128 in 9-16 bytes range
2019-09-27 17:55:33 -07:00
Yann Collet
3649220147 added tests for xxh128sum 2019-09-27 17:50:02 -07:00
Yann Collet
af010ba987 added xxh128sum
== xxhsum -H2
2019-09-27 17:40:36 -07:00
Yann Collet
9538a9d80b
Merge pull request #256 from Cyan4973/Loading
xxhsum -q does no longer display "Loading" notification
2019-09-17 20:51:29 -07:00
Yann Collet
ed35bc47a8
Merge pull request #255 from Cyan4973/license2
updated LICENSE
2019-09-17 18:09:30 -07:00
Yann Collet
d8551d294d xxhsum -q does not display "Loading" notification
fix #251
2019-09-17 18:08:32 -07:00
Yann Collet
330444389b updated LICENSE
to reflect the different terms
for the library (BSD-2)
and the command line interface (GPLv2),

answering #253
2019-09-17 17:14:15 -07:00
Yann Collet
1ce04e37a1
Merge pull request #254 from easyaspi314/multalign
Better 128-bit multiply, multiple bugfixes.
2019-09-16 21:48:56 -07:00
easyaspi314 (Devin)
e923cc63e0 Disable DIRECT_MEMORY_ACCESS check for Clang.
Clang prefers to emit aligned-only instructions with the second variant.

Clang works fine with memcpy.
2019-09-16 23:16:00 -04:00
easyaspi314 (Devin)
a1da6e28b0 Revert XXH_FORCE_DIRECT_MEMORY_ACCESS but exclude clang. 2019-09-16 19:07:20 -04:00
easyaspi314 (Devin)
6a768abdba Remove extra blank line 2019-09-16 10:12:04 -04:00
easyaspi314 (Devin)
1a5663552b Fix typo 2019-09-16 10:10:46 -04:00
easyaspi314 (Devin)
c94e68d705 Better 128-bit multiply, multiple bugfixes.
Sorry about the disorganized commit. :(

Yet again, I had to fix ARMv6. Clang went from ldm to ldrd which
also bus errors.

Therefore, I decided to fix the root problem and remove the
XXH_FORCE_DIRECT_MEMORY_ACCESS hack, using only memcpy.

This will kill alignment memes for good, and besides, it didn't
seem to make much of a difference.

Additionally, I added my better 128-bit long multiply
and applied DRY to XXH3_mul128_fold64. This also removes
the cryptic inline assembly hack.

Each method was documented, too (we need more comments).

Also, I added a warning for users who are compiling Thumb-1
code for a target supporting ARM instructions.

While all versions of ARM and Thumb-2 meet XXH3's base requirements,
Thumb-1 does not.

First of all, UMULL is inaccessible in the 16-bit subset. This means
that every XXH_mult32to64 means a call to __aeabi_lmul.

Since everything operation in XXH3 needs to happen in the Lo registers
plus having to setup r0-r3 many times for __aeabi_lmul, the output
resembles a game of Rush Hour:

 $ clang -O3 -S --target=arm-none-eabi -march=armv4t -mthumb xxhash.c
 $ grep -c mov xxhash.s
 5472
 $ clang -O3 -S --target=arm-none-eabi -march=armv4t xxhash.c
 $ grep -c mov xxhash.s
 2071

It is much more practical to compile xxHash with the wider instruction
sets, as these restrictions do not apply.

This doesn't warn if ARMv6-M is targeted; Thumb-1 is unavoidable.

Lastly, I removed the pragma clang loop hack which didn't work anymore
since the number of iterations can't be constant evaluated. Now, we
don't have 20 warnings when compiling for x86.
2019-09-16 10:09:00 -04:00
Yann Collet
69c9558be5
Merge pull request #252 from nigeltao/dev
Add comment about CRC32 speed comparison
2019-09-14 23:12:13 -07:00
Nigel Tao
879d0af51a Add comment about CRC32 speed comparison 2019-09-15 10:40:17 +10:00
Yann Collet
77fd98f6b5
Merge pull request #250 from Cyan4973/visualWarnings
Visual Studio tests on Appveyor
2019-09-10 13:41:23 -07:00
Yann Collet
a87e5908c7 hopefully fixed the Visual test on Appveyor
by using a custom variable XXHASH_C_FLAGS
as suggested by @wesm.
2019-09-10 10:53:58 -07:00
Yann Collet
e18a23a582 Visual Studio tests on Appveyor
now generate errors when there is a compiler warning
fix #249

Also fix a few corresponding minor warnings on Visual.
2019-09-06 16:05:44 -07:00
Yann Collet
726c14000c
Merge pull request #247 from easyaspi314/armv6fix
Prevent Clang from emitting unaligned ldm/ldrd on ARMv6, better arm macros
2019-08-28 15:58:12 -07:00
easyaspi314 (Devin)
8bcf561e21 Silence -Wundef warning
IT IS DEFINED BY THE STANDARD
2019-08-28 17:33:52 -04:00
easyaspi314 (Devin)
662e199ceb Prevent Clang from emitting unaligned ldm/ldrd on ARMv6, better arm macros
Clang was using ldmia and ldrd on unaligned pointers. These
instructions don't support unaligned access.

I also check the numerical value of __ARM_ARCH.
2019-08-28 17:18:16 -04:00
Yann Collet
17969c422d
Merge pull request #246 from bram-ivs/fixXXH32types
fix XXH32 and XXH32_digest return types
2019-08-27 06:17:20 -07:00