2017-12-10 02:04:25 +00:00
|
|
|
// adv-simd.h - written and placed in the public domain by Jeffrey Walton
|
2018-07-02 02:25:07 +00:00
|
|
|
|
|
|
|
/// \file adv-simd.h
|
|
|
|
/// \brief Template for AdvancedProcessBlocks and SIMD processing
|
|
|
|
|
2017-12-10 02:04:25 +00:00
|
|
|
// The SIMD based implementations for ciphers that use SSE, NEON and Power7
|
|
|
|
// have a commom pattern. Namely, they have a specialized implementation of
|
|
|
|
// AdvancedProcessBlocks which processes multiple block using hardware
|
|
|
|
// acceleration. After several implementations we noticed a lot of copy and
|
|
|
|
// paste occuring. adv-simd.h provides a template to avoid the copy and paste.
|
|
|
|
//
|
2018-07-02 02:25:07 +00:00
|
|
|
// There are 10 templates provided in this file. The number following the
|
2017-12-10 02:04:25 +00:00
|
|
|
// function name is the block size of the cipher. The name following that
|
2018-01-02 12:08:13 +00:00
|
|
|
// is the acceleration and arrangement. For example 4x1_SSE means Intel SSE
|
|
|
|
// using two encrypt (or decrypt) functions: one that operates on 4 blocks,
|
|
|
|
// and one that operates on 1 block.
|
2017-12-10 02:04:25 +00:00
|
|
|
//
|
2018-06-21 04:37:10 +00:00
|
|
|
// * AdvancedProcessBlocks64_2x1_SSE
|
2018-01-02 12:08:13 +00:00
|
|
|
// * AdvancedProcessBlocks64_4x1_SSE
|
|
|
|
// * AdvancedProcessBlocks128_4x1_SSE
|
|
|
|
// * AdvancedProcessBlocks64_6x2_SSE
|
|
|
|
// * AdvancedProcessBlocks128_6x2_SSE
|
|
|
|
// * AdvancedProcessBlocks64_6x2_NEON
|
2018-06-23 07:54:51 +00:00
|
|
|
// * AdvancedProcessBlocks128_4x1_NEON
|
2018-01-02 12:08:13 +00:00
|
|
|
// * AdvancedProcessBlocks128_6x2_NEON
|
2018-08-12 05:12:00 +00:00
|
|
|
// * AdvancedProcessBlocks64_6x1_ALTIVEC
|
|
|
|
// * AdvancedProcessBlocks128_6x1_ALTIVEC
|
2017-12-10 02:04:25 +00:00
|
|
|
//
|
2018-06-22 20:26:27 +00:00
|
|
|
// If an arrangement ends in 2, like 6x2, then the template will handle the
|
|
|
|
// single block case by padding with 0's and using the two block function.
|
|
|
|
// This happens at most one time when processing multiple blocks. The extra
|
|
|
|
// processing of a zero block is trivial and worth the tradeoff.
|
|
|
|
//
|
|
|
|
// The MAYBE_CONST macro present on x86 is a SunCC workaround. Some versions
|
|
|
|
// of SunCC lose/drop the const-ness in the F1 and F4 functions. It eventually
|
|
|
|
// results in a failed link due to the const/non-const mismatch.
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
#ifndef CRYPTOPP_ADVANCED_SIMD_TEMPLATES
|
2018-02-20 18:17:05 +00:00
|
|
|
#define CRYPTOPP_ADVANCED_SIMD_TEMPLATES
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
#include "config.h"
|
|
|
|
#include "misc.h"
|
2017-12-16 23:18:53 +00:00
|
|
|
#include "stdcpp.h"
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
#if (CRYPTOPP_ARM_NEON_AVAILABLE)
|
|
|
|
# include <arm_neon.h>
|
|
|
|
#endif
|
|
|
|
|
Add ARMv8.4 cpu feature detection support (GH #685) (#687)
This PR adds ARMv8.4 cpu feature detection support. Previously we only needed ARMv8.1 and things were much easier. For example, ARMv8.1 `__ARM_FEATURE_CRYPTO` meant PMULL, AES, SHA-1 and SHA-256 were available. ARMv8.4 `__ARM_FEATURE_CRYPTO` means PMULL, AES, SHA-1, SHA-256, SHA-512, SHA-3, SM3 and SM4 are available.
We still use the same pattern as before. We make something available based on compiler version and/or preprocessor macros. But this time around we had to tighten things up a bit to ensure ARMv8.4 did not cross-pollinate down into ARMv8.1.
ARMv8.4 is largely untested at the moment. There is no hardware in the field and CI lacks QEMU with the relevant patches/support. We will probably have to revisit some of this stuff in the future.
Since this update applies to ARM gadgets we took the time to expand Android and iOS testing on Travis. Travis now tests more platforms, and includes Autotools and CMake builds, too.
2018-07-15 12:35:14 +00:00
|
|
|
#if defined(CRYPTOPP_ARM_ACLE_AVAILABLE)
|
|
|
|
# include <stdint.h>
|
|
|
|
# include <arm_acle.h>
|
|
|
|
#endif
|
|
|
|
|
Fix build with Embarcadero C++Builder 10.2.3 (#696)
Fix two compilation errors encountered with C++Builder (Starter Edition):
- In `cpu.cpp`, 0ccdc197b introduced a dependency on `_xgetbv()` from `<immintrin.h>` that doesn't exist on C++Builder. Enlist it for the workaround, similar to SunCC in 692ed2a2b.
- In `adv-simd.h`, `<pmmintrin.h>` is being #included under the `CRYPTOPP_SSE2_INTRIN_AVAILABLE` macro. This header, [which apparently provides SSE3 intrinsics](https://stackoverflow.com/a/11228864/1433768), is not shipped with C++Builder. (This section of code was recently downgraded from a SSSE3 to a SSE2 block in 09c8ae28, followed by moving away from `<immintrin.h>` in bc8da71a, followed by reintroducing the SSSE3 check in d1e646a5.) Split the SSE2 and SSSE3 cases such that `<pmmintrin.h>` is not #included for SSE2. This seems safe to do, because some `git grep` analysis shows that:
- `adv-simd.h` is not #included by any other header, but only directly #included by some `.cpp` files.
- Among those `.cpp` files, only `sm4-simd.cpp` has a `CRYPTOPP_SSE2_INTRIN_AVAILABLE` preprocessor block, and there it again includes the other two headers (`<emmintrin.h>` and `<xmmintrin.h>`).
NOTE: I was compiling via the IDE after [setting up a project file](https://github.com/tanzislam/cryptopals/wiki/Importing-into-Embarcadero-C%E2%94%BC%E2%94%BCBuilder-Starter-10.2#using-the-crypto-library). My compilation command was effectively:
```
bcc32c.exe -DCRYPTOPP_NO_CXX11 -DCRYPTOPP_DISABLE_SSSE3 -D__SSE2__ -D__SSE__ -D__MMX__
```
2018-08-05 02:54:36 +00:00
|
|
|
#if (CRYPTOPP_SSE2_INTRIN_AVAILABLE)
|
|
|
|
# include <emmintrin.h>
|
|
|
|
# include <xmmintrin.h>
|
|
|
|
#endif
|
|
|
|
|
2018-07-16 13:37:08 +00:00
|
|
|
// SunCC needs CRYPTOPP_SSSE3_AVAILABLE, too
|
Fix build with Embarcadero C++Builder 10.2.3 (#696)
Fix two compilation errors encountered with C++Builder (Starter Edition):
- In `cpu.cpp`, 0ccdc197b introduced a dependency on `_xgetbv()` from `<immintrin.h>` that doesn't exist on C++Builder. Enlist it for the workaround, similar to SunCC in 692ed2a2b.
- In `adv-simd.h`, `<pmmintrin.h>` is being #included under the `CRYPTOPP_SSE2_INTRIN_AVAILABLE` macro. This header, [which apparently provides SSE3 intrinsics](https://stackoverflow.com/a/11228864/1433768), is not shipped with C++Builder. (This section of code was recently downgraded from a SSSE3 to a SSE2 block in 09c8ae28, followed by moving away from `<immintrin.h>` in bc8da71a, followed by reintroducing the SSSE3 check in d1e646a5.) Split the SSE2 and SSSE3 cases such that `<pmmintrin.h>` is not #included for SSE2. This seems safe to do, because some `git grep` analysis shows that:
- `adv-simd.h` is not #included by any other header, but only directly #included by some `.cpp` files.
- Among those `.cpp` files, only `sm4-simd.cpp` has a `CRYPTOPP_SSE2_INTRIN_AVAILABLE` preprocessor block, and there it again includes the other two headers (`<emmintrin.h>` and `<xmmintrin.h>`).
NOTE: I was compiling via the IDE after [setting up a project file](https://github.com/tanzislam/cryptopals/wiki/Importing-into-Embarcadero-C%E2%94%BC%E2%94%BCBuilder-Starter-10.2#using-the-crypto-library). My compilation command was effectively:
```
bcc32c.exe -DCRYPTOPP_NO_CXX11 -DCRYPTOPP_DISABLE_SSSE3 -D__SSE2__ -D__SSE__ -D__MMX__
```
2018-08-05 02:54:36 +00:00
|
|
|
#if (CRYPTOPP_SSSE3_AVAILABLE)
|
2018-07-06 05:14:28 +00:00
|
|
|
# include <emmintrin.h>
|
|
|
|
# include <pmmintrin.h>
|
2018-06-23 16:58:55 +00:00
|
|
|
# include <xmmintrin.h>
|
2017-12-10 02:04:25 +00:00
|
|
|
#endif
|
|
|
|
|
2018-01-02 12:08:13 +00:00
|
|
|
#if defined(CRYPTOPP_ALTIVEC_AVAILABLE)
|
|
|
|
# include "ppc-simd.h"
|
|
|
|
#endif
|
|
|
|
|
2017-12-10 02:04:25 +00:00
|
|
|
// ************************ All block ciphers *********************** //
|
|
|
|
|
|
|
|
ANONYMOUS_NAMESPACE_BEGIN
|
|
|
|
|
|
|
|
using CryptoPP::BlockTransformation;
|
|
|
|
|
|
|
|
CRYPTOPP_CONSTANT(BT_XorInput = BlockTransformation::BT_XorInput)
|
|
|
|
CRYPTOPP_CONSTANT(BT_AllowParallel = BlockTransformation::BT_AllowParallel)
|
|
|
|
CRYPTOPP_CONSTANT(BT_InBlockIsCounter = BlockTransformation::BT_InBlockIsCounter)
|
|
|
|
CRYPTOPP_CONSTANT(BT_ReverseDirection = BlockTransformation::BT_ReverseDirection)
|
|
|
|
CRYPTOPP_CONSTANT(BT_DontIncrementInOutPointers = BlockTransformation::BT_DontIncrementInOutPointers)
|
|
|
|
|
|
|
|
ANONYMOUS_NAMESPACE_END
|
|
|
|
|
|
|
|
// *************************** ARM NEON ************************** //
|
|
|
|
|
2018-03-05 23:49:10 +00:00
|
|
|
#if (CRYPTOPP_ARM_NEON_AVAILABLE)
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
NAMESPACE_BEGIN(CryptoPP)
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 2 and 6 blocks
|
|
|
|
/// \tparam F2 function to process 2 64-bit blocks
|
|
|
|
/// \tparam F6 function to process 6 64-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks64_6x2_NEON processes 6 and 2 NEON SIMD words
|
|
|
|
/// at a time. For a single block the template uses F2 with a zero block.
|
|
|
|
/// \details The subkey type is usually word32 or word64. F2 and F6 must use the
|
|
|
|
/// same word type.
|
|
|
|
template <typename F2, typename F6, typename W>
|
2018-01-02 12:08:13 +00:00
|
|
|
inline size_t AdvancedProcessBlocks64_6x2_NEON(F2 func2, F6 func6,
|
2018-06-20 23:25:52 +00:00
|
|
|
const W *subKeys, size_t rounds, const byte *inBlocks,
|
2017-12-10 02:04:25 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 8);
|
|
|
|
|
2018-02-20 18:17:05 +00:00
|
|
|
#if defined(CRYPTOPP_LITTLE_ENDIAN)
|
|
|
|
const word32 s_zero32x4[] = {0, 0, 0, 0};
|
|
|
|
const word32 s_one32x4_1b[] = {0, 0, 0, 1<<24};
|
|
|
|
const word32 s_one32x4_2b[] = {0, 2<<24, 0, 2<<24};
|
|
|
|
#else
|
|
|
|
const word32 s_zero32x4[] = {0, 0, 0, 0};
|
|
|
|
const word32 s_one32x4_1b[] = {0, 0, 0, 1};
|
|
|
|
const word32 s_one32x4_2b[] = {0, 2, 0, 2};
|
|
|
|
#endif
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 8;
|
|
|
|
const size_t neonBlockSize = 16;
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : neonBlockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? neonBlockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : neonBlockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - neonBlockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - neonBlockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - neonBlockSize);
|
2017-12-10 02:04:25 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 6*neonBlockSize)
|
|
|
|
{
|
|
|
|
uint32x4_t block0, block1, block2, block3, block4, block5;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
|
|
|
// For 64-bit block ciphers we need to load the CTR block, which is 8 bytes.
|
|
|
|
// After the dup load we have two counters in the NEON word. Then we need
|
|
|
|
// to increment the low ctr by 0 and the high ctr by 1.
|
|
|
|
const uint8x8_t ctr = vld1_u8(inBlocks);
|
|
|
|
block0 = vaddq_u32(vld1q_u32(s_one32x4_1b),
|
|
|
|
vreinterpretq_u32_u8(vcombine_u8(ctr,ctr)));
|
|
|
|
|
|
|
|
// After initial increment of {0,1} remaining counters increment by {2,2}.
|
|
|
|
const uint32x4_t be2 = vld1q_u32(s_one32x4_2b);
|
|
|
|
block1 = vaddq_u32(be2, block0);
|
|
|
|
block2 = vaddq_u32(be2, block1);
|
|
|
|
block3 = vaddq_u32(be2, block2);
|
|
|
|
block4 = vaddq_u32(be2, block3);
|
|
|
|
block5 = vaddq_u32(be2, block4);
|
|
|
|
|
|
|
|
vst1_u8(const_cast<byte*>(inBlocks), vget_low_u8(
|
|
|
|
vreinterpretq_u8_u32(vaddq_u32(be2, block5))));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u32(block0, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u32(block1, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = veorq_u32(block2, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = veorq_u32(block3, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = veorq_u32(block4, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = veorq_u32(block5, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u32(block0, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u32(block1, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = veorq_u32(block2, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = veorq_u32(block3, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = veorq_u32(block4, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = veorq_u32(block5, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block2));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block3));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block4));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block5));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 6*neonBlockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= 2*neonBlockSize)
|
|
|
|
{
|
|
|
|
uint32x4_t block0, block1;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
|
|
|
// For 64-bit block ciphers we need to load the CTR block, which is 8 bytes.
|
|
|
|
// After the dup load we have two counters in the NEON word. Then we need
|
|
|
|
// to increment the low ctr by 0 and the high ctr by 1.
|
|
|
|
const uint8x8_t ctr = vld1_u8(inBlocks);
|
|
|
|
block0 = vaddq_u32(vld1q_u32(s_one32x4_1b),
|
|
|
|
vreinterpretq_u32_u8(vcombine_u8(ctr,ctr)));
|
|
|
|
|
|
|
|
// After initial increment of {0,1} remaining counters increment by {2,2}.
|
|
|
|
const uint32x4_t be2 = vld1q_u32(s_one32x4_2b);
|
|
|
|
block1 = vaddq_u32(be2, block0);
|
|
|
|
|
|
|
|
vst1_u8(const_cast<byte*>(inBlocks), vget_low_u8(
|
|
|
|
vreinterpretq_u8_u32(vaddq_u32(be2, block1))));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u32(block0, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u32(block1, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func2(block0, block1, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u32(block0, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u32(block1, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 2*neonBlockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (length)
|
|
|
|
{
|
|
|
|
// Adjust to real block size
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
|
|
|
inIncrement += inIncrement ? blockSize : 0;
|
|
|
|
xorIncrement += xorIncrement ? blockSize : 0;
|
|
|
|
outIncrement += outIncrement ? blockSize : 0;
|
|
|
|
inBlocks -= inIncrement;
|
|
|
|
xorBlocks -= xorIncrement;
|
|
|
|
outBlocks -= outIncrement;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
inIncrement -= inIncrement ? blockSize : 0;
|
|
|
|
xorIncrement -= xorIncrement ? blockSize : 0;
|
|
|
|
outIncrement -= outIncrement ? blockSize : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
uint32x4_t block, zero = vld1q_u32(s_zero32x4);
|
|
|
|
|
|
|
|
const uint8x8_t v = vld1_u8(inBlocks);
|
|
|
|
block = vreinterpretq_u32_u8(vcombine_u8(v,v));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
const uint8x8_t x = vld1_u8(xorBlocks);
|
|
|
|
block = veorq_u32(block, vreinterpretq_u32_u8(vcombine_u8(x,x)));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[7]++;
|
|
|
|
|
|
|
|
func2(block, zero, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
const uint8x8_t x = vld1_u8(xorBlocks);
|
|
|
|
block = veorq_u32(block, vreinterpretq_u32_u8(vcombine_u8(x,x)));
|
|
|
|
}
|
|
|
|
|
|
|
|
vst1_u8(const_cast<byte*>(outBlocks),
|
|
|
|
vget_low_u8(vreinterpretq_u8_u32(block)));
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 6 blocks
|
2018-06-22 20:26:27 +00:00
|
|
|
/// \tparam F1 function to process 1 128-bit block
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \tparam F6 function to process 6 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
2018-06-23 16:35:06 +00:00
|
|
|
/// \details AdvancedProcessBlocks128_6x1_NEON processes 6 and 2 NEON SIMD words
|
2018-06-20 23:25:52 +00:00
|
|
|
/// at a time.
|
|
|
|
/// \details The subkey type is usually word32 or word64. F1 and F6 must use the
|
|
|
|
/// same word type.
|
|
|
|
template <typename F1, typename F6, typename W>
|
2018-06-23 16:35:06 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_6x1_NEON(F1 func1, F6 func6,
|
2018-06-20 23:25:52 +00:00
|
|
|
const W *subKeys, size_t rounds, const byte *inBlocks,
|
2017-12-10 16:09:50 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2018-02-20 18:17:05 +00:00
|
|
|
#if defined(CRYPTOPP_LITTLE_ENDIAN)
|
|
|
|
const word32 s_zero32x4[] = {0, 0, 0, 0};
|
|
|
|
const word32 s_one32x4[] = {0, 0, 0, 1<<24};
|
|
|
|
#else
|
|
|
|
const word32 s_zero32x4[] = {0, 0, 0, 0};
|
|
|
|
const word32 s_one32x4[] = {0, 0, 0, 1};
|
|
|
|
#endif
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t neonBlockSize = 16;
|
2017-12-10 16:09:50 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2017-12-10 16:09:50 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2017-12-10 16:09:50 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 6*blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block0, block1, block2, block3, block4, block5;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
|
|
|
const uint64x2_t be = vreinterpretq_u64_u32(vld1q_u32(s_one32x4));
|
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
|
|
|
|
|
|
|
block1 = vaddq_u64(block0, be);
|
|
|
|
block2 = vaddq_u64(block1, be);
|
|
|
|
block3 = vaddq_u64(block2, be);
|
|
|
|
block4 = vaddq_u64(block3, be);
|
|
|
|
block5 = vaddq_u64(block4, be);
|
|
|
|
vst1q_u8(const_cast<byte*>(inBlocks),
|
|
|
|
vreinterpretq_u8_u64(vaddq_u64(block5, be)));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block4 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block5 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 16:09:50 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block4 = veorq_u64(block4, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block5 = veorq_u64(block5, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 16:09:50 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block4 = veorq_u64(block4, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block5 = veorq_u64(block5, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block2));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block3));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block4));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block5));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
|
|
|
|
length -= 6*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block;
|
|
|
|
block = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 16:09:50 +00:00
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func1(block, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 16:09:50 +00:00
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block));
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-23 07:54:51 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 4 blocks
|
|
|
|
/// \tparam F1 function to process 1 128-bit block
|
|
|
|
/// \tparam F4 function to process 4 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
2018-06-23 16:27:25 +00:00
|
|
|
/// \tparam V vector type of the NEON datatype
|
|
|
|
/// \details AdvancedProcessBlocks128_4x1_NEON processes 4 and 1 NEON SIMD words
|
2018-06-23 07:54:51 +00:00
|
|
|
/// at a time.
|
|
|
|
/// \details The subkey type is usually word32 or word64. V is the vector type and it is
|
|
|
|
/// usually uint32x4_t or uint64x2_t. F1, F4, W and V must use the same word and
|
2018-06-23 16:27:25 +00:00
|
|
|
/// vector type. The V parameter is used to avoid template argument
|
|
|
|
/// deduction/substitution failures.
|
2018-06-23 07:54:51 +00:00
|
|
|
template <typename F1, typename F4, typename W, typename V>
|
|
|
|
inline size_t AdvancedProcessBlocks128_4x1_NEON(F1 func1, F4 func4,
|
|
|
|
const V& unused, const W *subKeys, size_t rounds, const byte *inBlocks,
|
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
CRYPTOPP_UNUSED(unused);
|
|
|
|
|
|
|
|
#if defined(CRYPTOPP_LITTLE_ENDIAN)
|
|
|
|
const word32 s_one32x4[] = {0, 0, 0, 1<<24};
|
|
|
|
#else
|
|
|
|
const word32 s_one32x4[] = {0, 0, 0, 1};
|
|
|
|
#endif
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t neonBlockSize = 16;
|
2018-06-23 07:54:51 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2018-06-23 07:54:51 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 4*blockSize)
|
|
|
|
{
|
2018-07-14 16:59:42 +00:00
|
|
|
uint64x2_t block0, block1, block2, block3;
|
2018-06-23 07:54:51 +00:00
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
|
|
|
const uint64x2_t be = vreinterpretq_u64_u32(vld1q_u32(s_one32x4));
|
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
|
|
|
|
|
|
|
block1 = vaddq_u64(block0, be);
|
|
|
|
block2 = vaddq_u64(block1, be);
|
|
|
|
block3 = vaddq_u64(block2, be);
|
|
|
|
vst1q_u8(const_cast<byte*>(inBlocks),
|
|
|
|
vreinterpretq_u8_u64(vaddq_u64(block3, be)));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
block1 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
block2 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
block3 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (xorInput)
|
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func4((V&)block0, (V&)block1, (V&)block2, (V&)block3, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
|
|
|
if (xorOutput)
|
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block2));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block3));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
length -= 4*blockSize;
|
|
|
|
}
|
2018-07-01 05:23:35 +00:00
|
|
|
}
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
|
|
|
|
|
|
|
if (xorInput)
|
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func1( (V&)block, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
|
|
|
if (xorOutput)
|
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block));
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 2 and 6 blocks
|
|
|
|
/// \tparam F2 function to process 2 128-bit blocks
|
|
|
|
/// \tparam F6 function to process 6 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks128_6x2_NEON processes 6 and 2 NEON SIMD words
|
|
|
|
/// at a time. For a single block the template uses F2 with a zero block.
|
|
|
|
/// \details The subkey type is usually word32 or word64. F2 and F6 must use the
|
|
|
|
/// same word type.
|
|
|
|
template <typename F2, typename F6, typename W>
|
2018-02-20 18:17:05 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_6x2_NEON(F2 func2, F6 func6,
|
2018-06-20 23:25:52 +00:00
|
|
|
const W *subKeys, size_t rounds, const byte *inBlocks,
|
2017-12-10 02:04:25 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2018-02-20 18:17:05 +00:00
|
|
|
#if defined(CRYPTOPP_LITTLE_ENDIAN)
|
|
|
|
const word32 s_one32x4[] = {0, 0, 0, 1<<24};
|
|
|
|
#else
|
|
|
|
const word32 s_one32x4[] = {0, 0, 0, 1};
|
|
|
|
#endif
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t neonBlockSize = 16;
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2017-12-10 02:04:25 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 6*blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block0, block1, block2, block3, block4, block5;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2017-12-10 16:09:50 +00:00
|
|
|
const uint64x2_t be = vreinterpretq_u64_u32(vld1q_u32(s_one32x4));
|
2017-12-10 02:04:25 +00:00
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
|
|
|
|
|
|
|
block1 = vaddq_u64(block0, be);
|
|
|
|
block2 = vaddq_u64(block1, be);
|
|
|
|
block3 = vaddq_u64(block2, be);
|
|
|
|
block4 = vaddq_u64(block3, be);
|
|
|
|
block5 = vaddq_u64(block4, be);
|
|
|
|
vst1q_u8(const_cast<byte*>(inBlocks),
|
|
|
|
vreinterpretq_u8_u64(vaddq_u64(block5, be)));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = veorq_u64(block4, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = veorq_u64(block5, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = veorq_u64(block4, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = veorq_u64(block5, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block2));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block3));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block4));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block5));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 6*blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= 2*blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block0, block1;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2017-12-10 16:09:50 +00:00
|
|
|
const uint64x2_t be = vreinterpretq_u64_u32(vld1q_u32(s_one32x4));
|
2017-12-10 02:04:25 +00:00
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
|
|
|
block1 = vaddq_u64(block0, be);
|
|
|
|
|
|
|
|
vst1q_u8(const_cast<byte*>(inBlocks),
|
|
|
|
vreinterpretq_u8_u64(vaddq_u64(block1, be)));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func2(block0, block1, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 2*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block, zero = {0,0};
|
|
|
|
block = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func2(block, zero, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block));
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2017-12-28 06:16:17 +00:00
|
|
|
NAMESPACE_END // CryptoPP
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
#endif // CRYPTOPP_ARM_NEON_AVAILABLE
|
|
|
|
|
|
|
|
// *************************** Intel SSE ************************** //
|
|
|
|
|
|
|
|
#if defined(CRYPTOPP_SSSE3_AVAILABLE)
|
|
|
|
|
2017-12-10 16:09:50 +00:00
|
|
|
// Hack for SunCC, http://github.com/weidai11/cryptopp/issues/224
|
|
|
|
#if (__SUNPRO_CC >= 0x5130)
|
|
|
|
# define MAYBE_CONST
|
|
|
|
# define MAYBE_UNCONST_CAST(T, x) const_cast<MAYBE_CONST T>(x)
|
|
|
|
#else
|
|
|
|
# define MAYBE_CONST const
|
|
|
|
# define MAYBE_UNCONST_CAST(T, x) (x)
|
|
|
|
#endif
|
|
|
|
|
2017-12-10 02:04:25 +00:00
|
|
|
// Clang __m128i casts, http://bugs.llvm.org/show_bug.cgi?id=20670
|
|
|
|
#ifndef M128_CAST
|
|
|
|
# define M128_CAST(x) ((__m128i *)(void *)(x))
|
|
|
|
#endif
|
|
|
|
#ifndef CONST_M128_CAST
|
|
|
|
# define CONST_M128_CAST(x) ((const __m128i *)(const void *)(x))
|
|
|
|
#endif
|
|
|
|
|
|
|
|
NAMESPACE_BEGIN(CryptoPP)
|
|
|
|
|
2018-07-01 07:29:12 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 2 blocks
|
|
|
|
/// \tparam F1 function to process 1 64-bit block
|
2018-07-01 07:42:17 +00:00
|
|
|
/// \tparam F2 function to process 2 64-bit blocks
|
2018-07-01 07:29:12 +00:00
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks64_2x1_SSE processes 2 and 1 SSE SIMD words
|
|
|
|
/// at a time.
|
|
|
|
/// \details The subkey type is usually word32 or word64. F1 and F2 must use the
|
|
|
|
/// same word type.
|
2018-06-21 04:37:10 +00:00
|
|
|
template <typename F1, typename F2, typename W>
|
2018-07-01 05:23:35 +00:00
|
|
|
inline size_t AdvancedProcessBlocks64_2x1_SSE(F1 func1, F2 func2,
|
2018-06-22 20:26:27 +00:00
|
|
|
MAYBE_CONST W *subKeys, size_t rounds, const byte *inBlocks,
|
2018-06-21 04:37:10 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 8);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 8;
|
|
|
|
const size_t xmmBlockSize = 16;
|
2018-06-21 04:37:10 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : xmmBlockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? xmmBlockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : xmmBlockSize;
|
2018-06-21 04:37:10 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - xmmBlockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - xmmBlockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - xmmBlockSize);
|
2018-06-21 04:37:10 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
double temp[2];
|
2018-06-21 04:37:10 +00:00
|
|
|
while (length >= 2*xmmBlockSize)
|
|
|
|
{
|
|
|
|
__m128i block0, block1;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
// Increment of 1 and 2 in big-endian compatible with the ctr byte array.
|
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
|
|
|
const __m128i s_two = _mm_set_epi32(2<<24, 0, 2<<24, 0);
|
|
|
|
|
2018-06-21 04:37:10 +00:00
|
|
|
// For 64-bit block ciphers we need to load the CTR block, which is 8 bytes.
|
|
|
|
// After the dup load we have two counters in the XMM word. Then we need
|
|
|
|
// to increment the low ctr by 0 and the high ctr by 1.
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, inBlocks, blockSize);
|
2018-08-12 23:04:14 +00:00
|
|
|
block0 = _mm_add_epi32(s_one, _mm_castpd_si128(_mm_loaddup_pd(temp)));
|
2018-06-21 04:37:10 +00:00
|
|
|
|
|
|
|
// After initial increment of {0,1} remaining counters increment by {2,2}.
|
2018-08-12 23:04:14 +00:00
|
|
|
block1 = _mm_add_epi32(s_two, block0);
|
2018-06-21 04:37:10 +00:00
|
|
|
|
2018-07-01 08:03:30 +00:00
|
|
|
// Store the next counter. When BT_InBlockIsCounter is set then
|
|
|
|
// inBlocks is backed by m_counterArray which is non-const.
|
2018-08-12 23:04:14 +00:00
|
|
|
_mm_store_sd(temp, _mm_castsi128_pd(_mm_add_epi64(s_two, block1)));
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(const_cast<byte*>(inBlocks), temp, blockSize);
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (xorInput)
|
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func2(block0, block1, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
|
|
|
if (xorOutput)
|
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
|
|
|
|
length -= 2*xmmBlockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (length)
|
|
|
|
{
|
|
|
|
// Adjust to real block size
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
|
|
|
inIncrement += inIncrement ? blockSize : 0;
|
|
|
|
xorIncrement += xorIncrement ? blockSize : 0;
|
|
|
|
outIncrement += outIncrement ? blockSize : 0;
|
|
|
|
inBlocks -= inIncrement;
|
|
|
|
xorBlocks -= xorIncrement;
|
|
|
|
outBlocks -= outIncrement;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
inIncrement -= inIncrement ? blockSize : 0;
|
|
|
|
xorIncrement -= xorIncrement ? blockSize : 0;
|
|
|
|
outIncrement -= outIncrement ? blockSize : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
double temp[2];
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, inBlocks, blockSize);
|
|
|
|
__m128i block = _mm_castpd_si128(_mm_load_sd(temp));
|
2018-06-21 04:37:10 +00:00
|
|
|
|
|
|
|
if (xorInput)
|
|
|
|
{
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, xorBlocks, blockSize);
|
|
|
|
block = _mm_xor_si128(block, _mm_castpd_si128(_mm_load_sd(temp)));
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[7]++;
|
|
|
|
|
|
|
|
func1(block, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
|
|
|
if (xorOutput)
|
|
|
|
{
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, xorBlocks, blockSize);
|
|
|
|
block = _mm_xor_si128(block, _mm_castpd_si128(_mm_load_sd(temp)));
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
|
2018-07-01 05:23:35 +00:00
|
|
|
_mm_store_sd(temp, _mm_castsi128_pd(block));
|
|
|
|
std::memcpy(outBlocks, temp, blockSize);
|
2018-06-21 04:37:10 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 2 and 6 blocks
|
|
|
|
/// \tparam F2 function to process 2 64-bit blocks
|
|
|
|
/// \tparam F6 function to process 6 64-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks64_6x2_SSE processes 6 and 2 SSE SIMD words
|
|
|
|
/// at a time. For a single block the template uses F2 with a zero block.
|
|
|
|
/// \details The subkey type is usually word32 or word64. F2 and F6 must use the
|
|
|
|
/// same word type.
|
|
|
|
template <typename F2, typename F6, typename W>
|
2018-07-01 05:23:35 +00:00
|
|
|
inline size_t AdvancedProcessBlocks64_6x2_SSE(F2 func2, F6 func6,
|
2018-06-22 20:26:27 +00:00
|
|
|
MAYBE_CONST W *subKeys, size_t rounds, const byte *inBlocks,
|
2017-12-10 02:04:25 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 8);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 8;
|
|
|
|
const size_t xmmBlockSize = 16;
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : xmmBlockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? xmmBlockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : xmmBlockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - xmmBlockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - xmmBlockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - xmmBlockSize);
|
2017-12-10 02:04:25 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
double temp[2];
|
2017-12-10 02:04:25 +00:00
|
|
|
while (length >= 6*xmmBlockSize)
|
|
|
|
{
|
|
|
|
__m128i block0, block1, block2, block3, block4, block5;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
// Increment of 1 and 2 in big-endian compatible with the ctr byte array.
|
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
|
|
|
const __m128i s_two = _mm_set_epi32(2<<24, 0, 2<<24, 0);
|
|
|
|
|
2017-12-10 02:04:25 +00:00
|
|
|
// For 64-bit block ciphers we need to load the CTR block, which is 8 bytes.
|
|
|
|
// After the dup load we have two counters in the XMM word. Then we need
|
|
|
|
// to increment the low ctr by 0 and the high ctr by 1.
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, inBlocks, blockSize);
|
2018-08-12 23:04:14 +00:00
|
|
|
block0 = _mm_add_epi32(s_one, _mm_castpd_si128(_mm_loaddup_pd(temp)));
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
// After initial increment of {0,1} remaining counters increment by {2,2}.
|
2018-08-12 23:04:14 +00:00
|
|
|
block1 = _mm_add_epi32(s_two, block0);
|
|
|
|
block2 = _mm_add_epi32(s_two, block1);
|
|
|
|
block3 = _mm_add_epi32(s_two, block2);
|
|
|
|
block4 = _mm_add_epi32(s_two, block3);
|
|
|
|
block5 = _mm_add_epi32(s_two, block4);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2018-07-01 08:03:30 +00:00
|
|
|
// Store the next counter. When BT_InBlockIsCounter is set then
|
|
|
|
// inBlocks is backed by m_counterArray which is non-const.
|
2018-08-12 23:04:14 +00:00
|
|
|
_mm_store_sd(temp, _mm_castsi128_pd(_mm_add_epi32(s_two, block5)));
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(const_cast<byte*>(inBlocks), temp, blockSize);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = _mm_xor_si128(block4, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = _mm_xor_si128(block5, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = _mm_xor_si128(block4, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = _mm_xor_si128(block5, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block2);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block3);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block4);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block5);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 6*xmmBlockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= 2*xmmBlockSize)
|
|
|
|
{
|
|
|
|
__m128i block0, block1;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
// Increment of 1 and 2 in big-endian compatible with the ctr byte array.
|
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
|
|
|
const __m128i s_two = _mm_set_epi32(2<<24, 0, 2<<24, 0);
|
|
|
|
|
2017-12-10 02:04:25 +00:00
|
|
|
// For 64-bit block ciphers we need to load the CTR block, which is 8 bytes.
|
|
|
|
// After the dup load we have two counters in the XMM word. Then we need
|
|
|
|
// to increment the low ctr by 0 and the high ctr by 1.
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, inBlocks, blockSize);
|
2018-08-12 23:04:14 +00:00
|
|
|
block0 = _mm_add_epi32(s_one, _mm_castpd_si128(_mm_loaddup_pd(temp)));
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
// After initial increment of {0,1} remaining counters increment by {2,2}.
|
2018-08-12 23:04:14 +00:00
|
|
|
block1 = _mm_add_epi32(s_two, block0);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2018-07-01 08:03:30 +00:00
|
|
|
// Store the next counter. When BT_InBlockIsCounter is set then
|
|
|
|
// inBlocks is backed by m_counterArray which is non-const.
|
2018-08-12 23:04:14 +00:00
|
|
|
_mm_store_sd(temp, _mm_castsi128_pd(_mm_add_epi64(s_two, block1)));
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(const_cast<byte*>(inBlocks), temp, blockSize);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func2(block0, block1, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 2*xmmBlockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (length)
|
|
|
|
{
|
|
|
|
// Adjust to real block size
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
|
|
|
inIncrement += inIncrement ? blockSize : 0;
|
|
|
|
xorIncrement += xorIncrement ? blockSize : 0;
|
|
|
|
outIncrement += outIncrement ? blockSize : 0;
|
|
|
|
inBlocks -= inIncrement;
|
|
|
|
xorBlocks -= xorIncrement;
|
|
|
|
outBlocks -= outIncrement;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
inIncrement -= inIncrement ? blockSize : 0;
|
|
|
|
xorIncrement -= xorIncrement ? blockSize : 0;
|
|
|
|
outIncrement -= outIncrement ? blockSize : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
double temp[2];
|
2017-12-10 02:04:25 +00:00
|
|
|
__m128i block, zero = _mm_setzero_si128();
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, inBlocks, blockSize);
|
|
|
|
block = _mm_castpd_si128(_mm_load_sd(temp));
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, xorBlocks, blockSize);
|
|
|
|
block = _mm_xor_si128(block,
|
|
|
|
_mm_castpd_si128(_mm_load_sd(temp)));
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[7]++;
|
|
|
|
|
|
|
|
func2(block, zero, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, xorBlocks, blockSize);
|
|
|
|
block = _mm_xor_si128(block,
|
|
|
|
_mm_castpd_si128(_mm_load_sd(temp)));
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-07-01 05:23:35 +00:00
|
|
|
_mm_store_sd(temp, _mm_castsi128_pd(block));
|
|
|
|
std::memcpy(outBlocks, temp, blockSize);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 2 and 6 blocks
|
|
|
|
/// \tparam F2 function to process 2 128-bit blocks
|
|
|
|
/// \tparam F6 function to process 6 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks128_6x2_SSE processes 6 and 2 SSE SIMD words
|
|
|
|
/// at a time. For a single block the template uses F2 with a zero block.
|
|
|
|
/// \details The subkey type is usually word32 or word64. F2 and F6 must use the
|
|
|
|
/// same word type.
|
|
|
|
template <typename F2, typename F6, typename W>
|
2018-01-02 12:08:13 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_6x2_SSE(F2 func2, F6 func6,
|
2018-06-22 20:26:27 +00:00
|
|
|
MAYBE_CONST W *subKeys, size_t rounds, const byte *inBlocks,
|
2017-12-10 02:04:25 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t xmmBlockSize = 16;
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2017-12-10 02:04:25 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 6*blockSize)
|
|
|
|
{
|
|
|
|
__m128i block0, block1, block2, block3, block4, block5;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
// Increment of 1 in big-endian compatible with the ctr byte array.
|
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
2017-12-10 02:04:25 +00:00
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-08-12 23:04:14 +00:00
|
|
|
block1 = _mm_add_epi32(block0, s_one);
|
|
|
|
block2 = _mm_add_epi32(block1, s_one);
|
|
|
|
block3 = _mm_add_epi32(block2, s_one);
|
|
|
|
block4 = _mm_add_epi32(block3, s_one);
|
|
|
|
block5 = _mm_add_epi32(block4, s_one);
|
|
|
|
_mm_storeu_si128(M128_CAST(inBlocks), _mm_add_epi32(block5, s_one));
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = _mm_xor_si128(block4, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = _mm_xor_si128(block5, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = _mm_xor_si128(block4, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = _mm_xor_si128(block5, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block2);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block3);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block4);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block5);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 6*blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= 2*blockSize)
|
|
|
|
{
|
|
|
|
__m128i block0, block1;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
// Increment of 1 in big-endian compatible with the ctr byte array.
|
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
2017-12-10 02:04:25 +00:00
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-08-12 23:04:14 +00:00
|
|
|
block1 = _mm_add_epi32(block0, s_one);
|
|
|
|
_mm_storeu_si128(M128_CAST(inBlocks), _mm_add_epi32(block1, s_one));
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func2(block0, block1, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 2*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
__m128i block, zero = _mm_setzero_si128();
|
|
|
|
block = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
block = _mm_xor_si128(block, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func2(block, zero, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
block = _mm_xor_si128(block, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 4 blocks
|
2018-06-22 20:26:27 +00:00
|
|
|
/// \tparam F1 function to process 1 128-bit block
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \tparam F4 function to process 4 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks128_4x1_SSE processes 4 and 1 SSE SIMD words
|
|
|
|
/// at a time.
|
2018-06-22 20:26:27 +00:00
|
|
|
/// \details The subkey type is usually word32 or word64. F1 and F4 must use the
|
2018-06-20 23:25:52 +00:00
|
|
|
/// same word type.
|
|
|
|
template <typename F1, typename F4, typename W>
|
2018-01-02 12:08:13 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_4x1_SSE(F1 func1, F4 func4,
|
2018-06-20 23:25:52 +00:00
|
|
|
MAYBE_CONST W *subKeys, size_t rounds, const byte *inBlocks,
|
2017-12-10 16:09:50 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t xmmBlockSize = 16;
|
2017-12-10 16:09:50 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2017-12-10 16:09:50 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2017-12-10 16:09:50 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 4*blockSize)
|
|
|
|
{
|
|
|
|
__m128i block0, block1, block2, block3;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
// Increment of 1 in big-endian compatible with the ctr byte array.
|
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
2017-12-10 16:09:50 +00:00
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-08-12 23:04:14 +00:00
|
|
|
block1 = _mm_add_epi32(block0, s_one);
|
|
|
|
block2 = _mm_add_epi32(block1, s_one);
|
|
|
|
block3 = _mm_add_epi32(block2, s_one);
|
|
|
|
_mm_storeu_si128(M128_CAST(inBlocks), _mm_add_epi32(block3, s_one));
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 16:09:50 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func4(block0, block1, block2, block3, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 16:09:50 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block2);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block3);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
|
|
|
|
length -= 4*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
__m128i block = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 16:09:50 +00:00
|
|
|
block = _mm_xor_si128(block, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func1(block, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 16:09:50 +00:00
|
|
|
block = _mm_xor_si128(block, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-07-01 07:29:12 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 4 blocks
|
|
|
|
/// \tparam F1 function to process 1 64-bit block
|
|
|
|
/// \tparam F4 function to process 6 64-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks64_4x1_SSE processes 4 and 1 SSE SIMD words
|
|
|
|
/// at a time.
|
|
|
|
/// \details The subkey type is usually word32 or word64. F1 and F4 must use the
|
|
|
|
/// same word type.
|
|
|
|
template <typename F1, typename F4, typename W>
|
|
|
|
inline size_t AdvancedProcessBlocks64_4x1_SSE(F1 func1, F4 func4,
|
2018-07-01 05:23:35 +00:00
|
|
|
MAYBE_CONST W *subKeys, size_t rounds, const byte *inBlocks,
|
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 8);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 8;
|
|
|
|
const size_t xmmBlockSize = 16;
|
2018-07-01 05:23:35 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter | BT_DontIncrementInOutPointers)) ? 0 : xmmBlockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? xmmBlockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : xmmBlockSize;
|
2018-07-01 05:23:35 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - xmmBlockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - xmmBlockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - xmmBlockSize);
|
2018-07-01 05:23:35 +00:00
|
|
|
inIncrement = 0 - inIncrement;
|
|
|
|
xorIncrement = 0 - xorIncrement;
|
|
|
|
outIncrement = 0 - outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
double temp[2];
|
2018-07-01 05:23:35 +00:00
|
|
|
while (length >= 4 * xmmBlockSize)
|
|
|
|
{
|
|
|
|
__m128i block0, block1, block2, block3;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
// Increment of 1 and 2 in big-endian compatible with the ctr byte array.
|
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
|
|
|
const __m128i s_two = _mm_set_epi32(2<<24, 0, 2<<24, 0);
|
|
|
|
|
2018-07-01 05:23:35 +00:00
|
|
|
// For 64-bit block ciphers we need to load the CTR block, which is 8 bytes.
|
|
|
|
// After the dup load we have two counters in the XMM word. Then we need
|
|
|
|
// to increment the low ctr by 0 and the high ctr by 1.
|
|
|
|
std::memcpy(temp, inBlocks, blockSize);
|
2018-08-12 23:04:14 +00:00
|
|
|
block0 = _mm_add_epi32(s_one, _mm_castpd_si128(_mm_loaddup_pd(temp)));
|
2018-07-01 05:23:35 +00:00
|
|
|
|
|
|
|
// After initial increment of {0,1} remaining counters increment by {2,2}.
|
2018-08-12 23:04:14 +00:00
|
|
|
block1 = _mm_add_epi32(s_two, block0);
|
|
|
|
block2 = _mm_add_epi32(s_two, block1);
|
|
|
|
block3 = _mm_add_epi32(s_two, block2);
|
2018-07-01 05:23:35 +00:00
|
|
|
|
2018-07-01 08:03:30 +00:00
|
|
|
// Store the next counter. When BT_InBlockIsCounter is set then
|
|
|
|
// inBlocks is backed by m_counterArray which is non-const.
|
2018-08-12 23:04:14 +00:00
|
|
|
_mm_store_sd(temp, _mm_castsi128_pd(_mm_add_epi64(s_two, block3)));
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(const_cast<byte*>(inBlocks), temp, blockSize);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
block2 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
block3 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (xorInput)
|
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
}
|
|
|
|
|
2018-07-01 07:29:12 +00:00
|
|
|
func4(block0, block1, block2, block3, subKeys, static_cast<unsigned int>(rounds));
|
2018-07-01 05:23:35 +00:00
|
|
|
|
|
|
|
if (xorOutput)
|
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block2);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block3);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
|
|
|
|
length -= 4 * xmmBlockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (length)
|
|
|
|
{
|
|
|
|
// Adjust to real block size
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
|
|
|
inIncrement += inIncrement ? blockSize : 0;
|
|
|
|
xorIncrement += xorIncrement ? blockSize : 0;
|
|
|
|
outIncrement += outIncrement ? blockSize : 0;
|
|
|
|
inBlocks -= inIncrement;
|
|
|
|
xorBlocks -= xorIncrement;
|
|
|
|
outBlocks -= outIncrement;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
inIncrement -= inIncrement ? blockSize : 0;
|
|
|
|
xorIncrement -= xorIncrement ? blockSize : 0;
|
|
|
|
outIncrement -= outIncrement ? blockSize : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
double temp[2];
|
2018-07-01 05:23:35 +00:00
|
|
|
std::memcpy(temp, inBlocks, blockSize);
|
|
|
|
__m128i block = _mm_castpd_si128(_mm_load_sd(temp));
|
|
|
|
|
|
|
|
if (xorInput)
|
|
|
|
{
|
|
|
|
std::memcpy(temp, xorBlocks, blockSize);
|
|
|
|
block = _mm_xor_si128(block, _mm_castpd_si128(_mm_load_sd(temp)));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[7]++;
|
|
|
|
|
|
|
|
func1(block, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
|
|
|
if (xorOutput)
|
|
|
|
{
|
|
|
|
std::memcpy(temp, xorBlocks, blockSize);
|
|
|
|
block = _mm_xor_si128(block, _mm_castpd_si128(_mm_load_sd(temp)));
|
|
|
|
}
|
|
|
|
|
|
|
|
_mm_store_sd(temp, _mm_castsi128_pd(block));
|
|
|
|
std::memcpy(outBlocks, temp, blockSize);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-07-01 05:23:35 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2017-12-28 06:16:17 +00:00
|
|
|
NAMESPACE_END // CryptoPP
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
#endif // CRYPTOPP_SSSE3_AVAILABLE
|
|
|
|
|
2018-01-02 12:08:13 +00:00
|
|
|
// *********************** Altivec/Power 4 ********************** //
|
|
|
|
|
|
|
|
#if defined(CRYPTOPP_ALTIVEC_AVAILABLE)
|
|
|
|
|
|
|
|
NAMESPACE_BEGIN(CryptoPP)
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 6 blocks
|
2018-06-22 20:26:27 +00:00
|
|
|
/// \tparam F1 function to process 1 128-bit block
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \tparam F6 function to process 6 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks128_6x1_ALTIVEC processes 6 and 1 Altivec SIMD words
|
|
|
|
/// at a time.
|
|
|
|
/// \details The subkey type is usually word32 or word64. F1 and F6 must use the
|
|
|
|
/// same word type.
|
|
|
|
template <typename F1, typename F6, typename W>
|
2018-02-20 18:17:05 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_6x1_ALTIVEC(F1 func1, F6 func6,
|
2018-06-20 23:25:52 +00:00
|
|
|
const W *subKeys, size_t rounds, const byte *inBlocks,
|
2018-02-20 18:17:05 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
2018-01-02 12:08:13 +00:00
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2018-02-20 18:17:05 +00:00
|
|
|
#if defined(CRYPTOPP_LITTLE_ENDIAN)
|
|
|
|
const uint32x4_p s_one = {1,0,0,0};
|
|
|
|
#else
|
|
|
|
const uint32x4_p s_one = {0,0,0,1};
|
|
|
|
#endif
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t vexBlockSize = 16;
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2018-01-02 12:08:13 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2018-01-02 12:08:13 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 6*blockSize)
|
|
|
|
{
|
2018-01-02 13:13:42 +00:00
|
|
|
uint32x4_p block0, block1, block2, block3, block4, block5, temp;
|
2018-01-02 12:08:13 +00:00
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-06 09:15:12 +00:00
|
|
|
block0 = VectorLoadBE(inBlocks);
|
2018-01-02 12:08:13 +00:00
|
|
|
block1 = VectorAdd(block0, s_one);
|
|
|
|
block2 = VectorAdd(block1, s_one);
|
|
|
|
block3 = VectorAdd(block2, s_one);
|
|
|
|
block4 = VectorAdd(block3, s_one);
|
|
|
|
block5 = VectorAdd(block4, s_one);
|
|
|
|
temp = VectorAdd(block5, s_one);
|
2018-08-06 09:15:12 +00:00
|
|
|
VectorStoreBE(temp, const_cast<byte*>(inBlocks));
|
2018-01-02 12:08:13 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2018-08-06 09:15:12 +00:00
|
|
|
block0 = VectorLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block1 = VectorLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block2 = VectorLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block3 = VectorLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block4 = VectorLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block5 = VectorLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2018-01-02 12:08:13 +00:00
|
|
|
{
|
2018-08-06 09:15:12 +00:00
|
|
|
block0 = VectorXor(block0, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block1 = VectorXor(block1, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block2 = VectorXor(block2, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block3 = VectorXor(block3, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block4 = VectorXor(block4, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block5 = VectorXor(block5, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, rounds);
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2018-01-02 12:08:13 +00:00
|
|
|
{
|
2018-08-06 09:15:12 +00:00
|
|
|
block0 = VectorXor(block0, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block1 = VectorXor(block1, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block2 = VectorXor(block2, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block3 = VectorXor(block3, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block4 = VectorXor(block4, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
block5 = VectorXor(block5, VectorLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
}
|
|
|
|
|
2018-08-06 09:15:12 +00:00
|
|
|
VectorStoreBE(block0, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
VectorStoreBE(block1, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
VectorStoreBE(block2, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
VectorStoreBE(block3, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
VectorStoreBE(block4, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-08-06 09:15:12 +00:00
|
|
|
VectorStoreBE(block5, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
|
|
|
|
length -= 6*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
2018-08-06 09:15:12 +00:00
|
|
|
uint32x4_p block = VectorLoadBE(inBlocks);
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2018-08-06 09:15:12 +00:00
|
|
|
block = VectorXor(block, VectorLoadBE(xorBlocks));
|
2018-01-02 12:08:13 +00:00
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func1(block, subKeys, rounds);
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2018-08-06 09:15:12 +00:00
|
|
|
block = VectorXor(block, VectorLoadBE(xorBlocks));
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2018-08-06 09:15:12 +00:00
|
|
|
VectorStoreBE(block, outBlocks);
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
|
|
|
NAMESPACE_END // CryptoPP
|
|
|
|
|
|
|
|
#endif // CRYPTOPP_ALTIVEC_AVAILABLE
|
|
|
|
|
2017-12-10 02:04:25 +00:00
|
|
|
#endif // CRYPTOPP_ADVANCED_SIMD_TEMPLATES
|