We tweaked ChaCha to arrive at the IETF's implementation specified by RFC 7539. We are not sure how to handle block counter wrap. At the moment the caller is responsible for managing it. We were not able to find a reference implementation so we disable SIMD implementations like SSE, AVX, NEON and Power4. We need the wide block tests for corner cases to ensure our implementation is correct.
This is only a work-around for the moment. The issue only affects SIMD code. The problem is, the algorithm we use performs a 32-bit add as an intermediate result, but we really need a 64-bit add. We are running 4 transforms in parallel, and we can't add and carry the way we need to.
The workaround is, whenever we could cross the 32-bit counter boundary we use the C version of the transform. We determine the cross-over point by 'bool safe = 0xffffffff - state.low > 4'. When not safe we skip the SIMD version of the algorithm and use the C version. Once we are safe again we use the SIMD version again.
The work-around costs us about 0.1 to 0.2 cpb. At 1.10 or 1.15 cpb that equates to about 200 MB/s on a Skylake. We'd like to get it back eventually.
Thanks to Jack Lloyd and Botan for allowing us to use the implementation.
The numbers for SSE2 are very good. When compared with Salsa20 ASM the results are:
* Salsa20 2.55 cpb; ChaCha/20 2.90 cpb
* Salsa20/12 1.61 cpb; ChaCha/12 1.90 cpb
* Salsa20/8 1.34 cpb; ChaCha/8 1.5 cpb
These changes made it in by accident at Commit b74a6f4445. We were going to try to let them ride but they broke versioning. They may be added later but we should avoid the change at this time.