This picks up about 0.2 cpb in ChaCha::OperateKeystream. It may not sound like much but it puts SSE2 intrinsics version on par with the ASM version of Salsa20. Salsa20 leads ChaCha by 0.1 to 0.15 cpb, which equates to about 50 MB/s.
Thanks to Jack Lloyd and Botan for allowing us to use the implementation.
The numbers for SSE2 are very good. When compared with Salsa20 ASM the results are:
* Salsa20 2.55 cpb; ChaCha/20 2.90 cpb
* Salsa20/12 1.61 cpb; ChaCha/12 1.90 cpb
* Salsa20/8 1.34 cpb; ChaCha/8 1.5 cpb