The SSSE3 intrinsics were performing aligned loads using _mm_load_si128 using user supplied pointers. The pointers are only a byte pointer, so its alignment can drop to 1 or 2. Switching to _mm_loadu_si128 will sidestep potential problems. The crash surfaced under Win32 testing.
Switch to memcpy's when performing bulk assignment x[0]=y[0] ... x[3]=y[3]. I believe Yun used the pattern to promote vectorization. Some compilers appear to be braindead and issue integer move's one word at a time. Non-braindead compiler will still take the optimization when advantageous, and slower compilers will benefit from the bulk move. We also cherry picked vectorization opportunities, like in ARIA_GSRK_NEON.
Remove keyBits variable. We now use UncheckedSetKey's keylen throughout.
Also fix a typo in CRYPTOPP_BOOL_SSSE3_INTRINSICS_AVAILABLE. __SSSE3__ was listed twice.
Win32 and Win64 benefited from the Intel intrinsics. A32 and Aarch64 benefited from the ARM intrinsics. The intrinsics shaved 150 to 350 cycles from key setup.
The intrinsics slowed modern GCC down a small bit, and did not appear to affect old GCC. As such, Intel intrinsics were only enabled for Microsoft compilers.
We were not able to improve encryption and decryption. In fact, some of the attempted macro conversions and intrinsics attempts slowed things down considerably. For example, GCC 5.4 on x86_64 went from 120 MB/s to about 70 MB/s when we tried to improve code around the Key XOR Layer (ARIA_KXL).
This is the reference implementation, test data and test vectors from the ARIA.zip package on the KISA website. The website is located at http://seed.kisa.or.kr/iwt/ko/bbs/EgovReferenceList.do?bbsId=BBSMSTR_000000000002.
We have optimized routines that improve Key Setup and Bulk Encryption performance, but they are not being checked-in at the moment. The ARIA team is updating its implementation for contemporary hardware and we would like to use it as a starting point before we wander too far away from the KISA implementation.
Couple use of initialization priorities to no NO_OS_DEPENDENCE
Add comments explaining what integer does, how it does it, and why we want to inprove on the Singleton pattern as a resource manager.
Update documentation.
This effectively decouples Integer and Public Key from the rest of the library. The change means a compile time define is used rather than a runtime pointer. It avoids the race with Issue 389.
The Public Key algorithms will fail if you use them. For example, running the self tests with CRYPTOPP_NO_ASSIGN_TO_INTEGER in effect results in "CryptoPP::Exception caught: NameValuePairs: type mismatch for 'EquivalentTo', stored 'i', trying to retrieve 'N8CryptoPP7IntegerE'". The exception is expected, and the same happend when g_pAssignIntToInteger was present.
This check-in also enables the 64-bit RDRAND routines for X32. The changes were with held until they could be tested. The testing occurred with Issue 395
Wrap DetectArmFeatures and DetectX86Features in InitializeCpu class
Use init_priority for InitializeCpu
Remove HAVE_GCC_CONSTRUCTOR1 and HAVE_GCC_CONSTRUCTOR0
Use init_seg(<name>) on Windows and explicitly insert at XCU segment
Simplify logic for HAVE_GAS
Remove special recipies for MACPORTS_GCC_COMPILER
Move C++ static initializers into anonymous namespace when possible
Add default NullNameValuePairs ctor for Clang
When MSVC init_seg or GCC init_priority is available, we don't need to use the Singleton. We only need to create a file scope class variable and place it in the segment for MSVC or provide the attribute for GCC.
An additional upside is we cleared all the memory leaks that used to be reported by MSVC for debug builds.
StringWiden converts a narrow C-style string to a wide string. It serves the opposite role of StringNarrow function. The function is useful on Windows platforms where the OS favors wide functions with the UTF-16 character set. For example, the Data Proction API (DPAPI) allows a description, but its a wide character C-string. There is no narrwo version of the API.
Port rdrand.S to Solaris
Port rdrand.S to X32
The X32 port is responsible for the loop unwinding. The unwind generates a 32-byte block (X64 and X32) or 16-byte block (X86). On X32, it increases throughut by 100% (doubles it). On X86 and X64, throughput increases by about 6%. Anything over 4 machine words slows things down.
Port rdrand.S to Cygwin and OS X
Add DISABLE_NATIVE_ARCH to CmakefileList and GNUmakefile. It supresses the addition of -march=native. DISABLE_NATIVE_ARCH replaces DISABLE_CXXFLAGS_OPTIMIZATIONS in CmakefileList (the latter is now deprecated).