xxHash

mirror of https://github.com/FEX-Emu/xxHash.git synced 2024-11-28 01:00:56 +00:00

Author	SHA1	Message	Date
Yann Collet	b604c7bee5	Merge pull request #269 from easyaspi314/endianness_fix Fix endianness detection on GCC, avoid XXH_cpuIsLittleEndian.	2019-10-04 13:38:51 -07:00
easyaspi314 (Devin)	028c0fd534	Fix endianness detection on GCC, avoid XXH_cpuIsLittleEndian.	2019-10-04 09:47:09 -04:00
Yann Collet	71a8150b6f	Merge pull request #267 from easyaspi314/main_loop_cleanup [SCALAR] Improve scalar XXH3_accumulate_512 loop	2019-10-03 10:38:43 -07:00
easyaspi314 (Devin)	9b6fa1067f	[SCALAR] Improve scalar XXH3_accumulate_512 loop The previous XXH3_accumulate_512 loop didn't fare well since XXH128 started swapping the addition. Neither GCC nor Clang could follow the barely-readable loop, resulting in garbage code output. This made XXH3 much slower. Take 32-bit scalar ARM. Ignoring loads and potential interleaving optimizations, in the main loop, XXH32 takes 16 cycles for 8 bytes on a typical ARMv6+ CPU, or 2 cpb. ```asm mla r0, r2, r5, r0 @ 4 cycles ror r0, r0, #19 @ 1 cycle mul r0, r0, r6 @ 3 cycles mla r1, r3, r5, r1 @ 4 cycles ror r1, r1, #19 @ 1 cycle mul r1, r1, r6 @ 3 cycles ``` XXH3_64b takes 9, or 1.1 cpb: ```asm adds r0, r0, r2 @ 2 cycles adc r1, r1, r3 @ 1 cycle eor r4, r4, r2 @ 1 cycle eor r5, r5, r3 @ 1 cycle umlal r0, r1, r4, r5 @ 4 cycles ``` Benchmarking on a Pixel 2 XL (with a binary for ARMv4T), previously, XXH32 got 1.8 GB/s, while XXH3_64b got 1.7. Now, XXH3_64b gets 2.3 GB/s! This calculates out well (as additional loads and stores have some overhead). Unlike before, it is better to disable autovectorization completely, as the compiler can't vectorize it as well. (Especially with Clang and NEON, where it extracts to multiply instead of the obvious vmlal.u32!). On that same device in aarch64 mode XXH3's scalar version when compiled with `clang-8 -O3 -DXXH_VECTOR=0 -fno-vectorize -fno-slp-vectorize`, XXH3 went from 2.3 GB/s to 4.3 GB/s. For comparison, the NEON version gets 6.0 GB/s. However, almost all platforms with decent autovectorization have a handwritten intrinsics version which is much faster. For optimal performance, use -fno-tree-vectorize -fno-tree-slp-vectorize (or simply disable SIMD instructions entirely). From testing, ARM32 also prefers forced inlining, so I enabled it. I also fixed some typos.	2019-10-03 09:56:18 -04:00
Yann Collet	96e8472380	documented opened API consistency questions	2019-10-02 14:47:59 -07:00
Yann Collet	28950be40c	updated code comments especially on the canonical representation paragraph, to make it clear it's the preferred format for storage and transmission.	2019-10-02 14:31:14 -07:00
Yann Collet	2b956f86b0	Merge pull request #266 from easyaspi314/varnames Try to improve some variable names.	2019-10-02 11:28:45 -07:00
Yann Collet	b2154f3583	Merge pull request #265 from easyaspi314/fair_bench Use both seeded and unseeded variants in the bench	2019-10-02 11:01:31 -07:00
easyaspi314 (Devin)	1367385768	Fix mixed declaration I need to stop coding before my coffee. :/	2019-10-02 13:01:51 -04:00
easyaspi314 (Devin)	425dbd8d86	Try to improve some variable names. It's a start, but an improvement. I still have more things I would like to change but it is good for now.	2019-10-02 12:28:01 -04:00
easyaspi314 (Devin)	91d6e4927e	Use both seeded and unseeded variants in the bench Previously, XXH3_64bits looked much faster than XXH3_128bits. The truth is that they are similar in long keys. The difference was that XXH3_64b's benchmark was unseeded, putting it at an unfair advantage over XXH128 which is seeded. I don't think I am going to do the dummy bench. That made things moe complicated.	2019-10-01 23:23:55 -04:00
Yann Collet	3df9e91856	Merge pull request #264 from easyaspi314/voidptrfix Reduce void pointers and evil casts.	2019-10-01 20:07:53 -07:00
easyaspi314 (Devin)	cb4adfcc10	Typo	2019-10-01 19:00:28 -04:00
easyaspi314 (Devin)	f90b0aba40	Reduce void pointers and evil casts.	2019-10-01 18:52:21 -04:00
Yann Collet	a44629ace1	Merge pull request #262 from Cyan4973/xxh128_17p improve xxh128 for mid-size	2019-09-30 23:11:01 -07:00
Yann Collet	c8f3fb514c	factorized mix32B changing xxh128 results for len within 129-240.	2019-09-30 22:36:07 -07:00
Yann Collet	9d79fd7bc1	factor mix32	2019-09-30 17:55:46 -07:00
Yann Collet	43b5c76b4c	fixed mistake in last ingested segment	2019-09-30 17:33:38 -07:00
Yann Collet	0bed0c2e5b	updated self-test values for xxh128	2019-09-30 17:26:04 -07:00
Yann Collet	6896c5798f	fix input distribution over 128-bit state for mid-size length 17+	2019-09-30 17:13:59 -07:00
Yann Collet	cd0f5c2209	slightly updated xxh128 at len 1-3 for a slightly better bias	2019-09-28 20:02:55 -07:00
Yann Collet	ea5c659701	update man page	2019-09-28 17:55:41 -07:00
Yann Collet	eab46160a9	update examples and comment	2019-09-28 17:39:00 -07:00
Yann Collet	384776e4ac	Merge pull request #260 from Cyan4973/xxh128sum XXH128	2019-09-28 17:23:55 -07:00
Yann Collet	549fca1204	added capability to control XXH128 hashes added xxh128sum link	2019-09-28 16:49:11 -07:00
Yann Collet	d5336efe31	fixed extraneous ' ' character failing `-c` verification test	2019-09-28 14:58:07 -07:00
Yann Collet	ce7dbf03e0	improved programming pattern for hashStream	2019-09-28 14:27:32 -07:00
Yann Collet	f2be00e938	update valgrind test	2019-09-27 19:50:40 -07:00
Yann Collet	e098fffe0a	fix #259 fix collisions for xxh128 in 9-16 bytes range	2019-09-27 17:55:33 -07:00
Yann Collet	3649220147	added tests for xxh128sum	2019-09-27 17:50:02 -07:00
Yann Collet	af010ba987	added xxh128sum == xxhsum -H2	2019-09-27 17:40:36 -07:00
Yann Collet	9538a9d80b	Merge pull request #256 from Cyan4973/Loading xxhsum -q does no longer display "Loading" notification	2019-09-17 20:51:29 -07:00
Yann Collet	ed35bc47a8	Merge pull request #255 from Cyan4973/license2 updated LICENSE	2019-09-17 18:09:30 -07:00
Yann Collet	d8551d294d	xxhsum -q does not display "Loading" notification fix #251	2019-09-17 18:08:32 -07:00
Yann Collet	330444389b	updated LICENSE to reflect the different terms for the library (BSD-2) and the command line interface (GPLv2), answering #253	2019-09-17 17:14:15 -07:00
Yann Collet	1ce04e37a1	Merge pull request #254 from easyaspi314/multalign Better 128-bit multiply, multiple bugfixes.	2019-09-16 21:48:56 -07:00
easyaspi314 (Devin)	e923cc63e0	Disable DIRECT_MEMORY_ACCESS check for Clang. Clang prefers to emit aligned-only instructions with the second variant. Clang works fine with memcpy.	2019-09-16 23:16:00 -04:00
easyaspi314 (Devin)	a1da6e28b0	Revert XXH_FORCE_DIRECT_MEMORY_ACCESS but exclude clang.	2019-09-16 19:07:20 -04:00
easyaspi314 (Devin)	6a768abdba	Remove extra blank line	2019-09-16 10:12:04 -04:00
easyaspi314 (Devin)	1a5663552b	Fix typo	2019-09-16 10:10:46 -04:00
easyaspi314 (Devin)	c94e68d705	Better 128-bit multiply, multiple bugfixes. Sorry about the disorganized commit. :( Yet again, I had to fix ARMv6. Clang went from ldm to ldrd which also bus errors. Therefore, I decided to fix the root problem and remove the XXH_FORCE_DIRECT_MEMORY_ACCESS hack, using only memcpy. This will kill alignment memes for good, and besides, it didn't seem to make much of a difference. Additionally, I added my better 128-bit long multiply and applied DRY to XXH3_mul128_fold64. This also removes the cryptic inline assembly hack. Each method was documented, too (we need more comments). Also, I added a warning for users who are compiling Thumb-1 code for a target supporting ARM instructions. While all versions of ARM and Thumb-2 meet XXH3's base requirements, Thumb-1 does not. First of all, UMULL is inaccessible in the 16-bit subset. This means that every XXH_mult32to64 means a call to __aeabi_lmul. Since everything operation in XXH3 needs to happen in the Lo registers plus having to setup r0-r3 many times for __aeabi_lmul, the output resembles a game of Rush Hour: $ clang -O3 -S --target=arm-none-eabi -march=armv4t -mthumb xxhash.c $ grep -c mov xxhash.s 5472 $ clang -O3 -S --target=arm-none-eabi -march=armv4t xxhash.c $ grep -c mov xxhash.s 2071 It is much more practical to compile xxHash with the wider instruction sets, as these restrictions do not apply. This doesn't warn if ARMv6-M is targeted; Thumb-1 is unavoidable. Lastly, I removed the pragma clang loop hack which didn't work anymore since the number of iterations can't be constant evaluated. Now, we don't have 20 warnings when compiling for x86.	2019-09-16 10:09:00 -04:00
Yann Collet	69c9558be5	Merge pull request #252 from nigeltao/dev Add comment about CRC32 speed comparison	2019-09-14 23:12:13 -07:00
Nigel Tao	879d0af51a	Add comment about CRC32 speed comparison	2019-09-15 10:40:17 +10:00
Yann Collet	77fd98f6b5	Merge pull request #250 from Cyan4973/visualWarnings Visual Studio tests on Appveyor	2019-09-10 13:41:23 -07:00
Yann Collet	a87e5908c7	hopefully fixed the Visual test on Appveyor by using a custom variable XXHASH_C_FLAGS as suggested by @wesm.	2019-09-10 10:53:58 -07:00
Yann Collet	e18a23a582	Visual Studio tests on Appveyor now generate errors when there is a compiler warning fix #249 Also fix a few corresponding minor warnings on Visual.	2019-09-06 16:05:44 -07:00
Yann Collet	726c14000c	Merge pull request #247 from easyaspi314/armv6fix Prevent Clang from emitting unaligned ldm/ldrd on ARMv6, better arm macros	2019-08-28 15:58:12 -07:00
easyaspi314 (Devin)	8bcf561e21	Silence -Wundef warning IT IS DEFINED BY THE STANDARD	2019-08-28 17:33:52 -04:00
easyaspi314 (Devin)	662e199ceb	Prevent Clang from emitting unaligned ldm/ldrd on ARMv6, better arm macros Clang was using ldmia and ldrd on unaligned pointers. These instructions don't support unaligned access. I also check the numerical value of __ARM_ARCH.	2019-08-28 17:18:16 -04:00
Yann Collet	17969c422d	Merge pull request #246 from bram-ivs/fixXXH32types fix XXH32 and XXH32_digest return types	2019-08-27 06:17:20 -07:00

1 2 3 4 5 ...

700 Commits