llvm/test at 32de7d791ec8107be63b306c1b501f2f07d313ee - llvm

RPCSX/llvm

mirror of https://github.com/RPCSX/llvm.git synced 2025-02-10 06:24:58 +00:00

History

Chandler Carruth fa68750e54 [x86] Unify the horizontal adding used for popcount lowering taking the

best approach of each.

For vNi16, we use SHL + ADD + SRL pattern that seem easily the best.

For vNi32, we use the PUNPCK + PSADBW + PACKUSWB pattern. In some cases
there is a huge improvement with this in IACA's estimated throughput --
over 2x higher throughput!!!! -- but the measurements are too good to be
true. In one narrow case, the SHL + ADD + SHL + ADD + SRL pattern looks
slightly faster, but I'm not sure I believe any of the measurements at
this point. Both are the exact same uops though. Hard to be confident of
anything past that.

If anyone wants to collect very detailed (Agner-level) timings with the
result of this patch, or with the i32 case replaced with SHL + ADD + SHl
+ ADD + SRL, I'd be very interested. Note that you'll need to test it on
both Ivybridge and Haswell, with both SSE3, SSSE3, and AVX selected as
I saw unique behavior in each of these buckets with IACA all of which
should be checked against measured performance.

But this patch is still a useful improvement by dropping duplicate work
and getting the much nicer PSADBW lowering for v2i64.

I'd still like to rephrase this in terms of generic horizontal sum. It's
a bit lame to have a special case of that just for popcount.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238652 91177308-0d34-0410-b5e6-96231b3b80d8

2015-05-30 10:35:03 +00:00

Analysis

[DependenceAnalysis] Extend unifySubscriptType for handling coupled subscript groups.

2015-05-29 16:58:08 +00:00

Assembler

IR / debug info: Add a DWOId field to DICompileUnit,

2015-05-21 20:37:30 +00:00

Bindings

IR: Give 'DI' prefix to debug info metadata

2015-04-29 16:38:44 +00:00

Bitcode

[BitcodeReader] Change an assert to a call to a call to Error()