2018-11-10 13:00:14 +00:00
|
|
|
// adv_simd.h - written and placed in the public domain by Jeffrey Walton
|
2018-07-02 02:25:07 +00:00
|
|
|
|
2018-11-10 13:00:14 +00:00
|
|
|
/// \file adv_simd.h
|
2018-07-02 02:25:07 +00:00
|
|
|
/// \brief Template for AdvancedProcessBlocks and SIMD processing
|
|
|
|
|
2017-12-10 02:04:25 +00:00
|
|
|
// The SIMD based implementations for ciphers that use SSE, NEON and Power7
|
|
|
|
// have a commom pattern. Namely, they have a specialized implementation of
|
|
|
|
// AdvancedProcessBlocks which processes multiple block using hardware
|
|
|
|
// acceleration. After several implementations we noticed a lot of copy and
|
2018-11-10 13:00:14 +00:00
|
|
|
// paste occuring. adv_simd.h provides a template to avoid the copy and paste.
|
2017-12-10 02:04:25 +00:00
|
|
|
//
|
2020-07-07 19:22:09 +00:00
|
|
|
// There are 6 templates provided in this file. The number following the
|
|
|
|
// function name, 128, is the block size in bits. The name following the
|
|
|
|
// block size is the arrangement and acceleration. For example 4x1_SSE means
|
|
|
|
// Intel SSE using two encrypt (or decrypt) functions: one that operates on
|
|
|
|
// 4 SIMD words, and one that operates on 1 SIMD words.
|
2018-08-14 09:15:32 +00:00
|
|
|
//
|
2018-01-02 12:08:13 +00:00
|
|
|
// * AdvancedProcessBlocks128_4x1_SSE
|
|
|
|
// * AdvancedProcessBlocks128_6x2_SSE
|
2018-06-23 07:54:51 +00:00
|
|
|
// * AdvancedProcessBlocks128_4x1_NEON
|
2020-07-29 03:06:34 +00:00
|
|
|
// * AdvancedProcessBlocks128_6x1_NEON
|
2018-08-13 10:38:30 +00:00
|
|
|
// * AdvancedProcessBlocks128_4x1_ALTIVEC
|
2018-08-12 05:12:00 +00:00
|
|
|
// * AdvancedProcessBlocks128_6x1_ALTIVEC
|
2017-12-10 02:04:25 +00:00
|
|
|
//
|
2018-06-22 20:26:27 +00:00
|
|
|
// If an arrangement ends in 2, like 6x2, then the template will handle the
|
2018-08-15 00:49:26 +00:00
|
|
|
// single block case by padding with 0's and using the two SIMD word
|
|
|
|
// function. This happens at most one time when processing multiple blocks.
|
|
|
|
// The extra processing of a zero block is trivial and worth the tradeoff.
|
2018-06-22 20:26:27 +00:00
|
|
|
//
|
|
|
|
// The MAYBE_CONST macro present on x86 is a SunCC workaround. Some versions
|
|
|
|
// of SunCC lose/drop the const-ness in the F1 and F4 functions. It eventually
|
|
|
|
// results in a failed link due to the const/non-const mismatch.
|
2020-07-07 19:22:09 +00:00
|
|
|
//
|
|
|
|
// In July 2020 the library stopped using 64-bit block version of
|
|
|
|
// AdvancedProcessBlocks. Testing showed unreliable results and failed
|
2020-07-07 19:37:37 +00:00
|
|
|
// self tests on occasion. Also see Issue 945 and
|
2020-07-07 19:32:48 +00:00
|
|
|
// https://github.com/weidai11/cryptopp/commit/dd7598e638bb.
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
#ifndef CRYPTOPP_ADVANCED_SIMD_TEMPLATES
|
2018-02-20 18:17:05 +00:00
|
|
|
#define CRYPTOPP_ADVANCED_SIMD_TEMPLATES
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
#include "config.h"
|
|
|
|
#include "misc.h"
|
2017-12-16 23:18:53 +00:00
|
|
|
#include "stdcpp.h"
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2019-10-17 01:00:45 +00:00
|
|
|
#if (CRYPTOPP_ARM_NEON_HEADER)
|
2017-12-10 02:04:25 +00:00
|
|
|
# include <arm_neon.h>
|
|
|
|
#endif
|
|
|
|
|
2019-10-17 01:00:45 +00:00
|
|
|
#if (CRYPTOPP_ARM_ACLE_HEADER)
|
Add ARMv8.4 cpu feature detection support (GH #685) (#687)
This PR adds ARMv8.4 cpu feature detection support. Previously we only needed ARMv8.1 and things were much easier. For example, ARMv8.1 `__ARM_FEATURE_CRYPTO` meant PMULL, AES, SHA-1 and SHA-256 were available. ARMv8.4 `__ARM_FEATURE_CRYPTO` means PMULL, AES, SHA-1, SHA-256, SHA-512, SHA-3, SM3 and SM4 are available.
We still use the same pattern as before. We make something available based on compiler version and/or preprocessor macros. But this time around we had to tighten things up a bit to ensure ARMv8.4 did not cross-pollinate down into ARMv8.1.
ARMv8.4 is largely untested at the moment. There is no hardware in the field and CI lacks QEMU with the relevant patches/support. We will probably have to revisit some of this stuff in the future.
Since this update applies to ARM gadgets we took the time to expand Android and iOS testing on Travis. Travis now tests more platforms, and includes Autotools and CMake builds, too.
2018-07-15 12:35:14 +00:00
|
|
|
# include <stdint.h>
|
|
|
|
# include <arm_acle.h>
|
|
|
|
#endif
|
|
|
|
|
Fix build with Embarcadero C++Builder 10.2.3 (#696)
Fix two compilation errors encountered with C++Builder (Starter Edition):
- In `cpu.cpp`, 0ccdc197b introduced a dependency on `_xgetbv()` from `<immintrin.h>` that doesn't exist on C++Builder. Enlist it for the workaround, similar to SunCC in 692ed2a2b.
- In `adv-simd.h`, `<pmmintrin.h>` is being #included under the `CRYPTOPP_SSE2_INTRIN_AVAILABLE` macro. This header, [which apparently provides SSE3 intrinsics](https://stackoverflow.com/a/11228864/1433768), is not shipped with C++Builder. (This section of code was recently downgraded from a SSSE3 to a SSE2 block in 09c8ae28, followed by moving away from `<immintrin.h>` in bc8da71a, followed by reintroducing the SSSE3 check in d1e646a5.) Split the SSE2 and SSSE3 cases such that `<pmmintrin.h>` is not #included for SSE2. This seems safe to do, because some `git grep` analysis shows that:
- `adv-simd.h` is not #included by any other header, but only directly #included by some `.cpp` files.
- Among those `.cpp` files, only `sm4-simd.cpp` has a `CRYPTOPP_SSE2_INTRIN_AVAILABLE` preprocessor block, and there it again includes the other two headers (`<emmintrin.h>` and `<xmmintrin.h>`).
NOTE: I was compiling via the IDE after [setting up a project file](https://github.com/tanzislam/cryptopals/wiki/Importing-into-Embarcadero-C%E2%94%BC%E2%94%BCBuilder-Starter-10.2#using-the-crypto-library). My compilation command was effectively:
```
bcc32c.exe -DCRYPTOPP_NO_CXX11 -DCRYPTOPP_DISABLE_SSSE3 -D__SSE2__ -D__SSE__ -D__MMX__
```
2018-08-05 02:54:36 +00:00
|
|
|
#if (CRYPTOPP_SSE2_INTRIN_AVAILABLE)
|
|
|
|
# include <emmintrin.h>
|
|
|
|
# include <xmmintrin.h>
|
|
|
|
#endif
|
|
|
|
|
2018-07-16 13:37:08 +00:00
|
|
|
// SunCC needs CRYPTOPP_SSSE3_AVAILABLE, too
|
Fix build with Embarcadero C++Builder 10.2.3 (#696)
Fix two compilation errors encountered with C++Builder (Starter Edition):
- In `cpu.cpp`, 0ccdc197b introduced a dependency on `_xgetbv()` from `<immintrin.h>` that doesn't exist on C++Builder. Enlist it for the workaround, similar to SunCC in 692ed2a2b.
- In `adv-simd.h`, `<pmmintrin.h>` is being #included under the `CRYPTOPP_SSE2_INTRIN_AVAILABLE` macro. This header, [which apparently provides SSE3 intrinsics](https://stackoverflow.com/a/11228864/1433768), is not shipped with C++Builder. (This section of code was recently downgraded from a SSSE3 to a SSE2 block in 09c8ae28, followed by moving away from `<immintrin.h>` in bc8da71a, followed by reintroducing the SSSE3 check in d1e646a5.) Split the SSE2 and SSSE3 cases such that `<pmmintrin.h>` is not #included for SSE2. This seems safe to do, because some `git grep` analysis shows that:
- `adv-simd.h` is not #included by any other header, but only directly #included by some `.cpp` files.
- Among those `.cpp` files, only `sm4-simd.cpp` has a `CRYPTOPP_SSE2_INTRIN_AVAILABLE` preprocessor block, and there it again includes the other two headers (`<emmintrin.h>` and `<xmmintrin.h>`).
NOTE: I was compiling via the IDE after [setting up a project file](https://github.com/tanzislam/cryptopals/wiki/Importing-into-Embarcadero-C%E2%94%BC%E2%94%BCBuilder-Starter-10.2#using-the-crypto-library). My compilation command was effectively:
```
bcc32c.exe -DCRYPTOPP_NO_CXX11 -DCRYPTOPP_DISABLE_SSSE3 -D__SSE2__ -D__SSE__ -D__MMX__
```
2018-08-05 02:54:36 +00:00
|
|
|
#if (CRYPTOPP_SSSE3_AVAILABLE)
|
2018-07-06 05:14:28 +00:00
|
|
|
# include <emmintrin.h>
|
|
|
|
# include <pmmintrin.h>
|
2018-06-23 16:58:55 +00:00
|
|
|
# include <xmmintrin.h>
|
2017-12-10 02:04:25 +00:00
|
|
|
#endif
|
|
|
|
|
2018-11-17 06:49:48 +00:00
|
|
|
#if defined(__ALTIVEC__)
|
2018-11-10 13:00:14 +00:00
|
|
|
# include "ppc_simd.h"
|
2018-01-02 12:08:13 +00:00
|
|
|
#endif
|
|
|
|
|
2017-12-10 02:04:25 +00:00
|
|
|
// ************************ All block ciphers *********************** //
|
|
|
|
|
|
|
|
ANONYMOUS_NAMESPACE_BEGIN
|
|
|
|
|
|
|
|
using CryptoPP::BlockTransformation;
|
|
|
|
|
2019-10-13 17:39:34 +00:00
|
|
|
CRYPTOPP_CONSTANT(BT_XorInput = BlockTransformation::BT_XorInput);
|
|
|
|
CRYPTOPP_CONSTANT(BT_AllowParallel = BlockTransformation::BT_AllowParallel);
|
|
|
|
CRYPTOPP_CONSTANT(BT_InBlockIsCounter = BlockTransformation::BT_InBlockIsCounter);
|
|
|
|
CRYPTOPP_CONSTANT(BT_ReverseDirection = BlockTransformation::BT_ReverseDirection);
|
|
|
|
CRYPTOPP_CONSTANT(BT_DontIncrementInOutPointers = BlockTransformation::BT_DontIncrementInOutPointers);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
ANONYMOUS_NAMESPACE_END
|
|
|
|
|
|
|
|
// *************************** ARM NEON ************************** //
|
|
|
|
|
2020-07-29 03:06:34 +00:00
|
|
|
#if (CRYPTOPP_ARM_NEON_AVAILABLE) || (CRYPTOPP_ARM_ASIMD_AVAILABLE) || \
|
|
|
|
defined(CRYPTOPP_DOXYGEN_PROCESSING)
|
2017-12-10 02:04:25 +00:00
|
|
|
NAMESPACE_BEGIN(CryptoPP)
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 6 blocks
|
2018-06-22 20:26:27 +00:00
|
|
|
/// \tparam F1 function to process 1 128-bit block
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \tparam F6 function to process 6 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
2018-06-23 16:35:06 +00:00
|
|
|
/// \details AdvancedProcessBlocks128_6x1_NEON processes 6 and 2 NEON SIMD words
|
2019-10-29 14:33:39 +00:00
|
|
|
/// at a time.
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \details The subkey type is usually word32 or word64. F1 and F6 must use the
|
2019-10-29 14:33:39 +00:00
|
|
|
/// same word type.
|
2018-06-20 23:25:52 +00:00
|
|
|
template <typename F1, typename F6, typename W>
|
2018-12-26 18:24:54 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_6x1_NEON(F1 func1, F6 func6,
|
2018-06-20 23:25:52 +00:00
|
|
|
const W *subKeys, size_t rounds, const byte *inBlocks,
|
2017-12-10 16:09:50 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2019-01-16 05:02:04 +00:00
|
|
|
const unsigned int w_one[] = {0, 0<<24, 0, 1<<24};
|
|
|
|
const uint32x4_t s_one = vld1q_u32(w_one);
|
2018-02-20 18:17:05 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t neonBlockSize = 16;
|
2017-12-10 16:09:50 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2017-12-10 16:09:50 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2017-12-10 16:09:50 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 6*blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block0, block1, block2, block3, block4, block5;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:51:50 +00:00
|
|
|
const uint64x2_t one = vreinterpretq_u64_u32(s_one);
|
2017-12-10 16:09:50 +00:00
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-08-12 23:51:50 +00:00
|
|
|
block1 = vaddq_u64(block0, one);
|
|
|
|
block2 = vaddq_u64(block1, one);
|
|
|
|
block3 = vaddq_u64(block2, one);
|
|
|
|
block4 = vaddq_u64(block3, one);
|
|
|
|
block5 = vaddq_u64(block4, one);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(const_cast<byte*>(inBlocks),
|
2018-08-12 23:51:50 +00:00
|
|
|
vreinterpretq_u8_u64(vaddq_u64(block5, one)));
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block4 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block5 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 16:09:50 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block4 = veorq_u64(block4, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block5 = veorq_u64(block5, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 16:09:50 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block4 = veorq_u64(block4, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block5 = veorq_u64(block5, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block2));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block3));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block4));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block5));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
|
|
|
|
length -= 6*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block;
|
|
|
|
block = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 16:09:50 +00:00
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func1(block, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 16:09:50 +00:00
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block));
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-23 07:54:51 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 4 blocks
|
|
|
|
/// \tparam F1 function to process 1 128-bit block
|
|
|
|
/// \tparam F4 function to process 4 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
2018-06-23 16:27:25 +00:00
|
|
|
/// \details AdvancedProcessBlocks128_4x1_NEON processes 4 and 1 NEON SIMD words
|
2019-10-29 14:33:39 +00:00
|
|
|
/// at a time.
|
2018-06-23 07:54:51 +00:00
|
|
|
/// \details The subkey type is usually word32 or word64. V is the vector type and it is
|
2019-10-29 14:33:39 +00:00
|
|
|
/// usually uint32x4_t or uint32x4_t. F1, F4, and W must use the same word and
|
|
|
|
/// vector type.
|
2019-01-04 00:49:00 +00:00
|
|
|
template <typename F1, typename F4, typename W>
|
2018-12-26 18:24:54 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_4x1_NEON(F1 func1, F4 func4,
|
2019-01-04 00:49:00 +00:00
|
|
|
const W *subKeys, size_t rounds, const byte *inBlocks,
|
2018-06-23 07:54:51 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2019-01-16 05:02:04 +00:00
|
|
|
const unsigned int w_one[] = {0, 0<<24, 0, 1<<24};
|
|
|
|
const uint32x4_t s_one = vld1q_u32(w_one);
|
2018-06-23 07:54:51 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t neonBlockSize = 16;
|
2018-06-23 07:54:51 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2018-06-23 07:54:51 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 4*blockSize)
|
|
|
|
{
|
2019-01-04 02:00:22 +00:00
|
|
|
uint32x4_t block0, block1, block2, block3;
|
2018-06-23 07:54:51 +00:00
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2019-01-04 02:00:22 +00:00
|
|
|
const uint32x4_t one = s_one;
|
|
|
|
block0 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2019-01-04 02:25:19 +00:00
|
|
|
block1 = vreinterpretq_u32_u64(vaddq_u64(vreinterpretq_u64_u32(block0), vreinterpretq_u64_u32(one)));
|
|
|
|
block2 = vreinterpretq_u32_u64(vaddq_u64(vreinterpretq_u64_u32(block1), vreinterpretq_u64_u32(one)));
|
|
|
|
block3 = vreinterpretq_u32_u64(vaddq_u64(vreinterpretq_u64_u32(block2), vreinterpretq_u64_u32(one)));
|
|
|
|
vst1q_u8(const_cast<byte*>(inBlocks), vreinterpretq_u8_u64(vaddq_u64(
|
|
|
|
vreinterpretq_u64_u32(block3), vreinterpretq_u64_u32(one))));
|
2018-06-23 07:54:51 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2019-01-04 02:00:22 +00:00
|
|
|
block0 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
block1 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
block2 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
block3 = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (xorInput)
|
|
|
|
{
|
2019-01-04 02:00:22 +00:00
|
|
|
block0 = veorq_u32(block0, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
block1 = veorq_u32(block1, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
block2 = veorq_u32(block2, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
block3 = veorq_u32(block3, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
}
|
|
|
|
|
2019-01-04 00:49:00 +00:00
|
|
|
func4(block0, block1, block2, block3, subKeys, static_cast<unsigned int>(rounds));
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
if (xorOutput)
|
|
|
|
{
|
2019-01-04 02:00:22 +00:00
|
|
|
block0 = veorq_u32(block0, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
block1 = veorq_u32(block1, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
block2 = veorq_u32(block2, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
block3 = veorq_u32(block3, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
}
|
|
|
|
|
2019-01-04 02:00:22 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block2));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2019-01-04 02:00:22 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block3));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
length -= 4*blockSize;
|
|
|
|
}
|
2018-07-01 05:23:35 +00:00
|
|
|
}
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
2019-01-04 02:00:22 +00:00
|
|
|
uint32x4_t block = vreinterpretq_u32_u8(vld1q_u8(inBlocks));
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
if (xorInput)
|
2019-01-04 02:00:22 +00:00
|
|
|
block = veorq_u32(block, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
2019-01-04 00:49:00 +00:00
|
|
|
func1(block, subKeys, static_cast<unsigned int>(rounds));
|
2018-06-23 07:54:51 +00:00
|
|
|
|
|
|
|
if (xorOutput)
|
2019-01-04 02:00:22 +00:00
|
|
|
block = veorq_u32(block, vreinterpretq_u32_u8(vld1q_u8(xorBlocks)));
|
2018-06-23 07:54:51 +00:00
|
|
|
|
2019-01-04 02:00:22 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u32(block));
|
2018-06-23 07:54:51 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-23 07:54:51 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 2 and 6 blocks
|
|
|
|
/// \tparam F2 function to process 2 128-bit blocks
|
|
|
|
/// \tparam F6 function to process 6 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks128_6x2_NEON processes 6 and 2 NEON SIMD words
|
2019-10-29 14:33:39 +00:00
|
|
|
/// at a time. For a single block the template uses F2 with a zero block.
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \details The subkey type is usually word32 or word64. F2 and F6 must use the
|
2019-10-29 14:33:39 +00:00
|
|
|
/// same word type.
|
2018-06-20 23:25:52 +00:00
|
|
|
template <typename F2, typename F6, typename W>
|
2018-12-26 18:24:54 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_6x2_NEON(F2 func2, F6 func6,
|
2018-06-20 23:25:52 +00:00
|
|
|
const W *subKeys, size_t rounds, const byte *inBlocks,
|
2017-12-10 02:04:25 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2019-01-16 05:02:04 +00:00
|
|
|
const unsigned int w_one[] = {0, 0<<24, 0, 1<<24};
|
|
|
|
const uint32x4_t s_one = vld1q_u32(w_one);
|
2018-02-20 18:17:05 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t neonBlockSize = 16;
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2017-12-10 02:04:25 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 6*blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block0, block1, block2, block3, block4, block5;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:51:50 +00:00
|
|
|
const uint64x2_t one = vreinterpretq_u64_u32(s_one);
|
2017-12-10 02:04:25 +00:00
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-08-12 23:51:50 +00:00
|
|
|
block1 = vaddq_u64(block0, one);
|
|
|
|
block2 = vaddq_u64(block1, one);
|
|
|
|
block3 = vaddq_u64(block2, one);
|
|
|
|
block4 = vaddq_u64(block3, one);
|
|
|
|
block5 = vaddq_u64(block4, one);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(const_cast<byte*>(inBlocks),
|
2018-08-12 23:51:50 +00:00
|
|
|
vreinterpretq_u8_u64(vaddq_u64(block5, one)));
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = veorq_u64(block4, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = veorq_u64(block5, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block2 = veorq_u64(block2, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block3 = veorq_u64(block3, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block4 = veorq_u64(block4, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block5 = veorq_u64(block5, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block2));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block3));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block4));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block5));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 6*blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= 2*blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block0, block1;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:51:50 +00:00
|
|
|
const uint64x2_t one = vreinterpretq_u64_u32(s_one);
|
2017-12-10 02:04:25 +00:00
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-08-12 23:51:50 +00:00
|
|
|
block1 = vaddq_u64(block0, one);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(const_cast<byte*>(inBlocks),
|
2018-08-12 23:51:50 +00:00
|
|
|
vreinterpretq_u8_u64(vaddq_u64(block1, one)));
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func2(block0, block1, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = veorq_u64(block0, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = veorq_u64(block1, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block0));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block1));
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 2*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
uint64x2_t block, zero = {0,0};
|
|
|
|
block = vreinterpretq_u64_u8(vld1q_u8(inBlocks));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func2(block, zero, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
block = veorq_u64(block, vreinterpretq_u64_u8(vld1q_u8(xorBlocks)));
|
|
|
|
|
|
|
|
vst1q_u8(outBlocks, vreinterpretq_u8_u64(block));
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2017-12-28 06:16:17 +00:00
|
|
|
NAMESPACE_END // CryptoPP
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
#endif // CRYPTOPP_ARM_NEON_AVAILABLE
|
|
|
|
|
|
|
|
// *************************** Intel SSE ************************** //
|
|
|
|
|
2019-10-29 14:16:11 +00:00
|
|
|
#if defined(CRYPTOPP_SSSE3_AVAILABLE) || defined(CRYPTOPP_DOXYGEN_PROCESSING)
|
2017-12-10 02:04:25 +00:00
|
|
|
|
2019-10-29 14:33:39 +00:00
|
|
|
#if defined(CRYPTOPP_DOXYGEN_PROCESSING)
|
|
|
|
/// \brief SunCC workaround
|
|
|
|
/// \details SunCC loses the const on AES_Enc_Block and AES_Dec_Block
|
|
|
|
/// \sa <A HREF="http://github.com/weidai11/cryptopp/issues/224">Issue
|
|
|
|
/// 224, SunCC and failed compile for rijndael.cpp</A>
|
|
|
|
# define MAYBE_CONST const
|
|
|
|
/// \brief SunCC workaround
|
|
|
|
/// \details SunCC loses the const on AES_Enc_Block and AES_Dec_Block
|
|
|
|
/// \sa <A HREF="http://github.com/weidai11/cryptopp/issues/224">Issue
|
|
|
|
/// 224, SunCC and failed compile for rijndael.cpp</A>
|
|
|
|
# define MAYBE_UNCONST_CAST(T, x) (x)
|
|
|
|
#elif (__SUNPRO_CC >= 0x5130)
|
2017-12-10 16:09:50 +00:00
|
|
|
# define MAYBE_CONST
|
|
|
|
# define MAYBE_UNCONST_CAST(T, x) const_cast<MAYBE_CONST T>(x)
|
|
|
|
#else
|
|
|
|
# define MAYBE_CONST const
|
|
|
|
# define MAYBE_UNCONST_CAST(T, x) (x)
|
|
|
|
#endif
|
|
|
|
|
2019-10-29 14:33:39 +00:00
|
|
|
#if defined(CRYPTOPP_DOXYGEN_PROCESSING)
|
|
|
|
/// \brief Clang workaround
|
|
|
|
/// \details Clang issues spurious alignment warnings
|
|
|
|
/// \sa <A HREF="http://bugs.llvm.org/show_bug.cgi?id=20670">Issue
|
|
|
|
/// 20670, _mm_loadu_si128 parameter has wrong type</A>
|
2017-12-10 02:04:25 +00:00
|
|
|
# define M128_CAST(x) ((__m128i *)(void *)(x))
|
2019-10-29 14:33:39 +00:00
|
|
|
/// \brief Clang workaround
|
|
|
|
/// \details Clang issues spurious alignment warnings
|
|
|
|
/// \sa <A HREF="http://bugs.llvm.org/show_bug.cgi?id=20670">Issue
|
|
|
|
/// 20670, _mm_loadu_si128 parameter has wrong type</A>
|
2017-12-10 02:04:25 +00:00
|
|
|
# define CONST_M128_CAST(x) ((const __m128i *)(const void *)(x))
|
2019-10-29 14:33:39 +00:00
|
|
|
#else
|
|
|
|
# ifndef M128_CAST
|
|
|
|
# define M128_CAST(x) ((__m128i *)(void *)(x))
|
|
|
|
# endif
|
|
|
|
# ifndef CONST_M128_CAST
|
|
|
|
# define CONST_M128_CAST(x) ((const __m128i *)(const void *)(x))
|
|
|
|
# endif
|
2017-12-10 02:04:25 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
NAMESPACE_BEGIN(CryptoPP)
|
|
|
|
|
2020-07-07 19:22:09 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 2 and 6 blocks
|
|
|
|
/// \tparam F2 function to process 2 128-bit blocks
|
|
|
|
/// \tparam F6 function to process 6 128-bit blocks
|
2018-07-01 07:29:12 +00:00
|
|
|
/// \tparam W word type of the subkey table
|
2020-07-07 19:22:09 +00:00
|
|
|
/// \details AdvancedProcessBlocks128_6x2_SSE processes 6 and 2 SSE SIMD words
|
|
|
|
/// at a time. For a single block the template uses F2 with a zero block.
|
|
|
|
/// \details The subkey type is usually word32 or word64. F2 and F6 must use the
|
2019-10-29 14:33:39 +00:00
|
|
|
/// same word type.
|
2020-07-07 19:22:09 +00:00
|
|
|
template <typename F2, typename F6, typename W>
|
|
|
|
inline size_t AdvancedProcessBlocks128_6x2_SSE(F2 func2, F6 func6,
|
2018-06-22 20:26:27 +00:00
|
|
|
MAYBE_CONST W *subKeys, size_t rounds, const byte *inBlocks,
|
2018-06-21 04:37:10 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
2020-07-07 19:22:09 +00:00
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
2018-06-21 04:37:10 +00:00
|
|
|
|
2020-07-07 19:22:09 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t xmmBlockSize = 16;
|
2018-06-21 04:37:10 +00:00
|
|
|
|
2020-07-07 19:22:09 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-06-21 04:37:10 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2020-07-07 19:22:09 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2018-06-21 04:37:10 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
2020-07-07 19:22:09 +00:00
|
|
|
while (length >= 6*blockSize)
|
2018-06-21 04:37:10 +00:00
|
|
|
{
|
2020-07-07 19:22:09 +00:00
|
|
|
__m128i block0, block1, block2, block3, block4, block5;
|
2018-06-21 04:37:10 +00:00
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2020-07-07 19:22:09 +00:00
|
|
|
// Increment of 1 in big-endian compatible with the ctr byte array.
|
2018-08-12 23:04:14 +00:00
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
2020-07-07 19:22:09 +00:00
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
block1 = _mm_add_epi32(block0, s_one);
|
|
|
|
block2 = _mm_add_epi32(block1, s_one);
|
|
|
|
block3 = _mm_add_epi32(block2, s_one);
|
|
|
|
block4 = _mm_add_epi32(block3, s_one);
|
|
|
|
block5 = _mm_add_epi32(block4, s_one);
|
|
|
|
_mm_storeu_si128(M128_CAST(inBlocks), _mm_add_epi32(block5, s_one));
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2020-07-07 19:22:09 +00:00
|
|
|
block2 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
block3 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
block4 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
block5 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (xorInput)
|
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2020-07-07 19:22:09 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
|
|
|
block4 = _mm_xor_si128(block4, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
|
|
|
block5 = _mm_xor_si128(block5, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
|
2020-07-07 19:22:09 +00:00
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, static_cast<unsigned int>(rounds));
|
2018-06-21 04:37:10 +00:00
|
|
|
|
|
|
|
if (xorOutput)
|
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2020-07-07 19:22:09 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
|
|
|
block4 = _mm_xor_si128(block4, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
|
|
|
block5 = _mm_xor_si128(block5, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2020-07-07 19:22:09 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block2);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block3);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block4);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block5);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-06-21 04:37:10 +00:00
|
|
|
|
2020-07-07 19:22:09 +00:00
|
|
|
length -= 6*blockSize;
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
|
|
|
|
2020-07-07 19:22:09 +00:00
|
|
|
while (length >= 2*blockSize)
|
2018-06-21 04:37:10 +00:00
|
|
|
{
|
2020-07-07 19:22:09 +00:00
|
|
|
__m128i block0, block1;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
2018-06-21 04:37:10 +00:00
|
|
|
{
|
2020-07-07 19:22:09 +00:00
|
|
|
// Increment of 1 in big-endian compatible with the ctr byte array.
|
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
block1 = _mm_add_epi32(block0, s_one);
|
|
|
|
_mm_storeu_si128(M128_CAST(inBlocks), _mm_add_epi32(block1, s_one));
|
2018-06-21 04:37:10 +00:00
|
|
|
}
|
2020-07-07 19:22:09 +00:00
|
|
|
else
|
2018-06-21 04:37:10 +00:00
|
|
|
{
|
2020-07-07 19:22:09 +00:00
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func2(block0, block1, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
length -= 2*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
__m128i block, zero = _mm_setzero_si128();
|
|
|
|
block = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 02:04:25 +00:00
|
|
|
block = _mm_xor_si128(block, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func2(block, zero, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 02:04:25 +00:00
|
|
|
block = _mm_xor_si128(block, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 02:04:25 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 4 blocks
|
2018-06-22 20:26:27 +00:00
|
|
|
/// \tparam F1 function to process 1 128-bit block
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \tparam F4 function to process 4 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks128_4x1_SSE processes 4 and 1 SSE SIMD words
|
2019-10-29 14:33:39 +00:00
|
|
|
/// at a time.
|
2018-06-22 20:26:27 +00:00
|
|
|
/// \details The subkey type is usually word32 or word64. F1 and F4 must use the
|
2019-10-29 14:33:39 +00:00
|
|
|
/// same word type.
|
2018-06-20 23:25:52 +00:00
|
|
|
template <typename F1, typename F4, typename W>
|
2018-12-26 18:24:54 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_4x1_SSE(F1 func1, F4 func4,
|
2018-06-20 23:25:52 +00:00
|
|
|
MAYBE_CONST W *subKeys, size_t rounds, const byte *inBlocks,
|
2017-12-10 16:09:50 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
|
|
|
// const size_t xmmBlockSize = 16;
|
2017-12-10 16:09:50 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2017-12-10 16:09:50 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2017-12-10 16:09:50 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 4*blockSize)
|
|
|
|
{
|
|
|
|
__m128i block0, block1, block2, block3;
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-08-12 23:04:14 +00:00
|
|
|
// Increment of 1 in big-endian compatible with the ctr byte array.
|
|
|
|
const __m128i s_one = _mm_set_epi32(1<<24, 0, 0, 0);
|
2017-12-10 16:09:50 +00:00
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-08-12 23:04:14 +00:00
|
|
|
block1 = _mm_add_epi32(block0, s_one);
|
|
|
|
block2 = _mm_add_epi32(block1, s_one);
|
|
|
|
block3 = _mm_add_epi32(block2, s_one);
|
|
|
|
_mm_storeu_si128(M128_CAST(inBlocks), _mm_add_epi32(block3, s_one));
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
block0 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 16:09:50 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func4(block0, block1, block2, block3, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 16:09:50 +00:00
|
|
|
{
|
|
|
|
block0 = _mm_xor_si128(block0, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block1 = _mm_xor_si128(block1, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block2 = _mm_xor_si128(block2, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
block3 = _mm_xor_si128(block3, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block0);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block1);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block2);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block3);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
|
|
|
|
length -= 4*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
|
|
|
__m128i block = _mm_loadu_si128(CONST_M128_CAST(inBlocks));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2017-12-10 16:09:50 +00:00
|
|
|
block = _mm_xor_si128(block, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func1(block, subKeys, static_cast<unsigned int>(rounds));
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2017-12-10 16:09:50 +00:00
|
|
|
block = _mm_xor_si128(block, _mm_loadu_si128(CONST_M128_CAST(xorBlocks)));
|
|
|
|
|
|
|
|
_mm_storeu_si128(M128_CAST(outBlocks), block);
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2017-12-10 16:09:50 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2017-12-28 06:16:17 +00:00
|
|
|
NAMESPACE_END // CryptoPP
|
2017-12-10 02:04:25 +00:00
|
|
|
|
|
|
|
#endif // CRYPTOPP_SSSE3_AVAILABLE
|
|
|
|
|
2019-10-29 14:16:11 +00:00
|
|
|
// ************************** Altivec/Power 4 ************************** //
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2019-10-29 14:16:11 +00:00
|
|
|
#if defined(__ALTIVEC__) || defined(CRYPTOPP_DOXYGEN_PROCESSING)
|
2018-01-02 12:08:13 +00:00
|
|
|
|
|
|
|
NAMESPACE_BEGIN(CryptoPP)
|
|
|
|
|
2018-08-13 10:38:30 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 4 blocks
|
|
|
|
/// \tparam F1 function to process 1 128-bit block
|
|
|
|
/// \tparam F4 function to process 4 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks128_4x1_ALTIVEC processes 4 and 1 Altivec SIMD words
|
2019-10-29 14:33:39 +00:00
|
|
|
/// at a time.
|
2018-08-13 10:38:30 +00:00
|
|
|
/// \details The subkey type is usually word32 or word64. F1 and F4 must use the
|
2019-10-29 14:33:39 +00:00
|
|
|
/// same word type.
|
2018-08-13 10:38:30 +00:00
|
|
|
template <typename F1, typename F4, typename W>
|
2018-12-26 18:24:54 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_4x1_ALTIVEC(F1 func1, F4 func4,
|
2018-08-13 10:38:30 +00:00
|
|
|
const W *subKeys, size_t rounds, const byte *inBlocks,
|
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2018-10-28 08:24:22 +00:00
|
|
|
#if (CRYPTOPP_LITTLE_ENDIAN)
|
2018-08-13 10:38:30 +00:00
|
|
|
const uint32x4_p s_one = {1,0,0,0};
|
|
|
|
#else
|
|
|
|
const uint32x4_p s_one = {0,0,0,1};
|
|
|
|
#endif
|
|
|
|
|
|
|
|
const size_t blockSize = 16;
|
2019-10-29 14:18:36 +00:00
|
|
|
// const size_t simdBlockSize = 16;
|
2018-08-13 10:38:30 +00:00
|
|
|
|
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 4*blockSize)
|
|
|
|
{
|
|
|
|
uint32x4_p block0, block1, block2, block3;
|
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
block0 = VecLoadBE(inBlocks);
|
|
|
|
block1 = VecAdd(block0, s_one);
|
|
|
|
block2 = VecAdd(block1, s_one);
|
|
|
|
block3 = VecAdd(block2, s_one);
|
2018-08-13 10:38:30 +00:00
|
|
|
|
|
|
|
// Hack due to big-endian loads used by POWER8 (and maybe ARM-BE).
|
|
|
|
// CTR_ModePolicy::OperateKeystream is wired such that after
|
|
|
|
// returning from this function CTR_ModePolicy will detect wrap on
|
|
|
|
// on the last counter byte and increment the next to last byte.
|
|
|
|
// The problem is, with a big-endian load, inBlocks[15] is really
|
|
|
|
// located at index 15. The vector addition using a 32-bit element
|
|
|
|
// generates a carry into inBlocks[14] and then CTR_ModePolicy
|
|
|
|
// increments inBlocks[14] too.
|
2018-08-15 07:16:08 +00:00
|
|
|
const_cast<byte*>(inBlocks)[15] += 6;
|
2018-08-13 10:38:30 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
block0 = VecLoadBE(inBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block1 = VecLoadBE(inBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block2 = VecLoadBE(inBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block3 = VecLoadBE(inBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (xorInput)
|
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
block0 = VecXor(block0, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block1 = VecXor(block1, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block2 = VecXor(block2, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block3 = VecXor(block3, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
|
|
|
}
|
|
|
|
|
|
|
|
func4(block0, block1, block2, block3, subKeys, rounds);
|
|
|
|
|
|
|
|
if (xorOutput)
|
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
block0 = VecXor(block0, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block1 = VecXor(block1, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block2 = VecXor(block2, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block3 = VecXor(block3, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
|
|
|
}
|
|
|
|
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block0, outBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block1, outBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block2, outBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block3, outBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
|
|
|
|
length -= 4*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
uint32x4_p block = VecLoadBE(inBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
|
|
|
|
if (xorInput)
|
2018-11-15 20:17:49 +00:00
|
|
|
block = VecXor(block, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func1(block, subKeys, rounds);
|
|
|
|
|
|
|
|
if (xorOutput)
|
2018-11-15 20:17:49 +00:00
|
|
|
block = VecXor(block, VecLoadBE(xorBlocks));
|
2018-08-13 10:38:30 +00:00
|
|
|
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block, outBlocks);
|
2018-08-13 10:38:30 +00:00
|
|
|
|
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \brief AdvancedProcessBlocks for 1 and 6 blocks
|
2018-06-22 20:26:27 +00:00
|
|
|
/// \tparam F1 function to process 1 128-bit block
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \tparam F6 function to process 6 128-bit blocks
|
|
|
|
/// \tparam W word type of the subkey table
|
|
|
|
/// \details AdvancedProcessBlocks128_6x1_ALTIVEC processes 6 and 1 Altivec SIMD words
|
2019-10-29 14:33:39 +00:00
|
|
|
/// at a time.
|
2018-06-20 23:25:52 +00:00
|
|
|
/// \details The subkey type is usually word32 or word64. F1 and F6 must use the
|
2019-10-29 14:33:39 +00:00
|
|
|
/// same word type.
|
2018-06-20 23:25:52 +00:00
|
|
|
template <typename F1, typename F6, typename W>
|
2018-12-26 18:24:54 +00:00
|
|
|
inline size_t AdvancedProcessBlocks128_6x1_ALTIVEC(F1 func1, F6 func6,
|
2018-06-20 23:25:52 +00:00
|
|
|
const W *subKeys, size_t rounds, const byte *inBlocks,
|
2018-02-20 18:17:05 +00:00
|
|
|
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
|
2018-01-02 12:08:13 +00:00
|
|
|
{
|
|
|
|
CRYPTOPP_ASSERT(subKeys);
|
|
|
|
CRYPTOPP_ASSERT(inBlocks);
|
|
|
|
CRYPTOPP_ASSERT(outBlocks);
|
|
|
|
CRYPTOPP_ASSERT(length >= 16);
|
|
|
|
|
2018-10-28 08:24:22 +00:00
|
|
|
#if (CRYPTOPP_LITTLE_ENDIAN)
|
2018-02-20 18:17:05 +00:00
|
|
|
const uint32x4_p s_one = {1,0,0,0};
|
|
|
|
#else
|
|
|
|
const uint32x4_p s_one = {0,0,0,1};
|
|
|
|
#endif
|
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
const size_t blockSize = 16;
|
2019-10-29 14:18:36 +00:00
|
|
|
// const size_t simdBlockSize = 16;
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
size_t inIncrement = (flags & (BT_InBlockIsCounter|BT_DontIncrementInOutPointers)) ? 0 : blockSize;
|
|
|
|
size_t xorIncrement = (xorBlocks != NULLPTR) ? blockSize : 0;
|
|
|
|
size_t outIncrement = (flags & BT_DontIncrementInOutPointers) ? 0 : blockSize;
|
2018-01-24 17:06:15 +00:00
|
|
|
|
|
|
|
// Clang and Coverity are generating findings using xorBlocks as a flag.
|
|
|
|
const bool xorInput = (xorBlocks != NULLPTR) && (flags & BT_XorInput);
|
|
|
|
const bool xorOutput = (xorBlocks != NULLPTR) && !(flags & BT_XorInput);
|
2018-01-02 12:08:13 +00:00
|
|
|
|
|
|
|
if (flags & BT_ReverseDirection)
|
|
|
|
{
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, length - blockSize);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, length - blockSize);
|
|
|
|
outBlocks = PtrAdd(outBlocks, length - blockSize);
|
2018-01-02 12:08:13 +00:00
|
|
|
inIncrement = 0-inIncrement;
|
|
|
|
xorIncrement = 0-xorIncrement;
|
|
|
|
outIncrement = 0-outIncrement;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & BT_AllowParallel)
|
|
|
|
{
|
|
|
|
while (length >= 6*blockSize)
|
|
|
|
{
|
2018-08-13 05:44:23 +00:00
|
|
|
uint32x4_p block0, block1, block2, block3, block4, block5;
|
2018-01-02 12:08:13 +00:00
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
block0 = VecLoadBE(inBlocks);
|
|
|
|
block1 = VecAdd(block0, s_one);
|
|
|
|
block2 = VecAdd(block1, s_one);
|
|
|
|
block3 = VecAdd(block2, s_one);
|
|
|
|
block4 = VecAdd(block3, s_one);
|
|
|
|
block5 = VecAdd(block4, s_one);
|
2018-08-13 05:44:23 +00:00
|
|
|
|
|
|
|
// Hack due to big-endian loads used by POWER8 (and maybe ARM-BE).
|
|
|
|
// CTR_ModePolicy::OperateKeystream is wired such that after
|
2018-08-13 05:51:01 +00:00
|
|
|
// returning from this function CTR_ModePolicy will detect wrap on
|
|
|
|
// on the last counter byte and increment the next to last byte.
|
|
|
|
// The problem is, with a big-endian load, inBlocks[15] is really
|
|
|
|
// located at index 15. The vector addition using a 32-bit element
|
|
|
|
// generates a carry into inBlocks[14] and then CTR_ModePolicy
|
|
|
|
// increments inBlocks[14] too.
|
2018-08-13 05:44:23 +00:00
|
|
|
//
|
|
|
|
// To find this bug we needed a test case with a ctr of 0xNN...FA.
|
|
|
|
// The last octet is 0xFA and adding 6 creates the wrap to trigger
|
|
|
|
// the issue. If the last octet was 0xFC then 4 would trigger it.
|
|
|
|
// We dumb-lucked into the test with SPECK-128. The test case of
|
|
|
|
// interest is the one with IV 348ECA9766C09F04 826520DE47A212FA.
|
2018-11-15 20:17:49 +00:00
|
|
|
uint8x16_p temp = VecAdd((uint8x16_p)block5, (uint8x16_p)s_one);
|
|
|
|
VecStoreBE(temp, const_cast<byte*>(inBlocks));
|
2018-01-02 12:08:13 +00:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
block0 = VecLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block1 = VecLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block2 = VecLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block3 = VecLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block4 = VecLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block5 = VecLoadBE(inBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
}
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2018-01-02 12:08:13 +00:00
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
block0 = VecXor(block0, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block1 = VecXor(block1, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block2 = VecXor(block2, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block3 = VecXor(block3, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block4 = VecXor(block4, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block5 = VecXor(block5, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
func6(block0, block1, block2, block3, block4, block5, subKeys, rounds);
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2018-01-02 12:08:13 +00:00
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
block0 = VecXor(block0, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block1 = VecXor(block1, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block2 = VecXor(block2, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block3 = VecXor(block3, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block4 = VecXor(block4, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
block5 = VecXor(block5, VecLoadBE(xorBlocks));
|
2018-07-10 09:00:02 +00:00
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
}
|
|
|
|
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block0, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block1, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block2, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block3, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block4, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block5, outBlocks);
|
2018-07-10 09:00:02 +00:00
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
|
|
|
|
length -= 6*blockSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
while (length >= blockSize)
|
|
|
|
{
|
2018-11-15 20:17:49 +00:00
|
|
|
uint32x4_p block = VecLoadBE(inBlocks);
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorInput)
|
2018-11-15 20:17:49 +00:00
|
|
|
block = VecXor(block, VecLoadBE(xorBlocks));
|
2018-01-02 12:08:13 +00:00
|
|
|
|
|
|
|
if (flags & BT_InBlockIsCounter)
|
|
|
|
const_cast<byte *>(inBlocks)[15]++;
|
|
|
|
|
|
|
|
func1(block, subKeys, rounds);
|
|
|
|
|
2018-01-24 17:06:15 +00:00
|
|
|
if (xorOutput)
|
2018-11-15 20:17:49 +00:00
|
|
|
block = VecXor(block, VecLoadBE(xorBlocks));
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2018-11-15 20:17:49 +00:00
|
|
|
VecStoreBE(block, outBlocks);
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2018-07-10 09:00:02 +00:00
|
|
|
inBlocks = PtrAdd(inBlocks, inIncrement);
|
|
|
|
outBlocks = PtrAdd(outBlocks, outIncrement);
|
|
|
|
xorBlocks = PtrAdd(xorBlocks, xorIncrement);
|
2018-01-02 12:08:13 +00:00
|
|
|
length -= blockSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
return length;
|
|
|
|
}
|
|
|
|
|
|
|
|
NAMESPACE_END // CryptoPP
|
|
|
|
|
2018-11-17 06:49:48 +00:00
|
|
|
#endif // __ALTIVEC__
|
2018-01-02 12:08:13 +00:00
|
|
|
|
2017-12-10 02:04:25 +00:00
|
|
|
#endif // CRYPTOPP_ADVANCED_SIMD_TEMPLATES
|