llvm-mirror

mirror of https://github.com/RPCS3/llvm-mirror.git synced 2025-05-16 12:05:58 +00:00

Author	SHA1	Message	Date
David Sherwood	2d2e4a1b17	[Analysis] Add simple cost model for strict (in-order) reductions I have added a new FastMathFlags parameter to getArithmeticReductionCost to indicate what type of reduction we are performing: 1. Tree-wise. This is the typical fast-math reduction that involves continually splitting a vector up into halves and adding each half together until we get a scalar result. This is the default behaviour for integers, whereas for floating point we only do this if reassociation is allowed. 2. Ordered. This now allows us to estimate the cost of performing a strict vector reduction by treating it as a series of scalar operations in lane order. This is the case when FP reassociation is not permitted. For scalable vectors this is more difficult because at compile time we do not know how many lanes there are, and so we use the worst case maximum vscale value. I have also fixed getTypeBasedIntrinsicInstrCost to pass in the FastMathFlags, which meant fixing up some X86 tests where we always assumed the vector.reduce.fadd/mul intrinsics were 'fast'. New tests have been added here: Analysis/CostModel/AArch64/reduce-fadd.ll Analysis/CostModel/AArch64/sve-intrinsics.ll Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll Differential Revision: https://reviews.llvm.org/D105432	2021-07-26 10:26:06 +01:00
Simon Pilgrim	c4477aac03	[CostModel][X86] Adjust shift SSE4 legalized costs based on llvm-mca reports. Update shl/lshr/ashr costs based on the worst case costs from the script in D103695 - many of the 128-bit shifts (usually where integer multiplies aren't used) have similar behaviour to AVX1 so we can merge them.	2021-07-22 20:07:32 +01:00
Simon Pilgrim	8371f55768	[CostModel][X86] Adjust shift SSE legalized costs based on llvm-mca reports. Update shl/lshr/ashr costs based on the worst case costs from the script in D103695.	2021-07-22 18:12:49 +01:00
Simon Pilgrim	32c30ddee7	[X86] Implement smarter instruction lowering for FP_TO_UINT from f32/f64 to i32/i64 and vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction. We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs (and for FP_TO_UINT, negative float values are undefined). We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic: small := CVTTPS2SI(x); fp_to_ui(x) := small \| (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31)) Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering. @TomHender checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded). Original Patch by @TomHender (Tom Hender) Differential Revision: https://reviews.llvm.org/D89697	2021-07-14 12:03:49 +01:00
Simon Pilgrim	436c202090	[CostModel][X86] Adjust fptosi/fptoui SSE/AVX legalized costs based on llvm-mca reports. Update (mainly) vXf32/vXf64 -> vXi8/vXi16 fptosi/fptoui costs based on the worst case costs from the script in D103695. Move to using legalized types wherever possible, which allows us to prune the cost tables.	2021-07-12 20:38:25 +01:00
Simon Pilgrim	063de0d715	[CostModel][X86] Adjust truncate SSE/AVX legalized costs based on llvm-mca reports. Update truncation costs based on the worst case costs from the script in D103695. Move to using legalized types wherever possible, which allows us to prune the cost tables.	2021-07-12 13:50:43 +01:00
David Green	fd61052e59	[TTI] Remove IsPairwiseForm from getArithmeticReductionCost This patch removes the IsPairwiseForm flag from the Reduction Cost TTI hooks, along with some accompanying code for pattern matching reductions from trees starting at extract elements. IsPairWise is now assumed to be false, which was the predominant way that the value was used from both the Loop and SLP vectorizers. Since the adjustments such as D93860, the SLP vectorizer has not relied upon this distinction between paiwise and non-pairwise reductions. This also removes some code that was detecting reductions trees starting from extract elements inside the costmodel. This case was double-counting costs though, adding the individual costs on the individual instruction _and_ the total cost of the reduction. Removing it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to not double count. The cost of reduction intrinsics is still tested through the various tests in llvm/test/Analysis/CostModel/X86/reduce-xyz.ll. Differential Revision: https://reviews.llvm.org/D105484	2021-07-09 11:51:16 +01:00
Alexey Bataev	9770438876	[SLP][COST][X86]Improve cost model for masked gather. Revived D101297 in its original form + added some changes in X86 legalization cehcking for masked gathers. This solution is the most stable and the most correct one. We have to check the legality before trying to build the masked gather in SLP. Without this check we have incorrect cost (for SLP) in case if the masked gather is not legal/slower than the gather. And we're missing some vectorization opportunities. This can be fixed in the cost model, but in this case we need to add special checks for the cost of GEPs for ScatterVectorize node, add special check for small trees, etc., i.e. there are a lot of corner cases here and there, which insrease code base and make it harder to maintain the code. > Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed. The question from D101297. Actually, no, it can't. Actually, simple gather may give us better result, especially after we started vectorization of insertelements. Plus, like I said before, the cost for non-legal masked gathers leads to missed vectorization opportunities. Differential Revision: https://reviews.llvm.org/D105042	2021-07-08 11:53:30 -07:00
Simon Pilgrim	5d69b0490b	[CostModel][X86] Account for older SSE targets with slow fp->int conversions Both the conversion cost and the xmm->gpr transfer cost tend to be a lot higher on early SSE targets	2021-07-08 18:08:24 +01:00
Simon Pilgrim	8dde5bc762	[CostModel][X86] Adjust sext/zext SSE/AVX legalized costs based on llvm-mca reports. Update costs based on the worst case costs from the script in D103695. Move to using legalized types wherever possible, which allows us to prune the cost tables.	2021-07-07 13:58:27 +01:00
Simon Pilgrim	929bb9374e	[CostModel][X86] Adjust sitofp/uitofp SSE/AVX legalized costs based on llvm-mca reports. Update (mainly) vXi8/vXi16 -> vXf32/vXf64 sitofp/uitofp costs based on the worst case costs from the script in D103695. Move to using legalized types wherever possible, which allows us to prune the cost tables.	2021-07-07 12:03:45 +01:00
Simon Pilgrim	b8bec02bc7	[CostModel][X86] fptosi/fptoui to i8/i16 are truncated from fptosi to i32 Provide a generic fallback that performs the fptosi to i32 types, then truncates to sub-i32 scalars. These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.	2021-07-06 17:28:03 +01:00
Simon Pilgrim	1787dc0460	[CostModel][X86] i8/i16 sitofp/uitofp are sext/zext to i32 for sitofp Provide a generic fallback that extends sub-i32 scalars before using the existing sitofp instructions. These numbers can be tweaked for specific sse levels, but we should get the default handling in place first. We get the extension for free for non-vector loads.	2021-07-06 13:58:52 +01:00
Simon Pilgrim	8ed3c99d18	[CostModel][X86] Handle costs for insert/extractelement with non-immediate indices via stack Determine the insert/extractelement costs when performing this as a sequence of aliased loads+stores via the stack.	2021-07-05 13:26:53 +01:00
Simon Pilgrim	4d75d8ef44	[CostModel][X86] Adjust i32/i64 to f32/f64 scalar based on llvm-mca reports (+ Agner). Older SSE targets have slower gpr->fpu scalar conversions - we also need to account for uitofp i32 > f32/f64 being lowered as sitofp i64 -> f32/f64	2021-07-05 13:26:53 +01:00
Simon Pilgrim	4d7e8eba5b	[CostModel][X86] Update comment describing source of costs - we now use llvm-mca more than IACA	2021-07-02 14:29:32 +01:00
Simon Pilgrim	5e6cee0948	[CostModel][X86] Drop some hard coded fp<->int scalarization costs Scalarization costs handling is a lot better now, and the hard coded costs were higher than the worse case numbers from the script in D103695	2021-07-02 14:29:32 +01:00
Simon Pilgrim	1a043ee5ed	[CostModel][X86] Find AVX conversion costs using legalized types if custom types didn't match Building on rG2a1ef8784ad9a, fallback to attempting to match against legalized types like we do for SSE targets.	2021-07-02 13:49:31 +01:00
Simon Pilgrim	c54e15d18e	[CostModel][X86] Adjust uitofp(vXi64) SSE/AVX legalized costs based on llvm-mca reports. Update v4i64 -> v4f32/v4f64 uitofp costs based on the worst case costs from the script in D103695. Fixes a few regressions before we start adding AVX costs for legalized types.	2021-07-02 13:09:00 +01:00
Simon Pilgrim	f0a2f7f0f9	[CostModel][X86] Adjust fp<->int vXi32 SSE legalized costs based on llvm-mca reports. Building on rG2a1ef8784ad9a, adjust the SSE cost tables to use the legalized types based on the worst case costs from the script in D103695. To account for different numbers of src/dst legalized type registers we must scale the cost by maximum of the src/dst, not just use src	2021-07-01 15:34:20 +01:00
Simon Pilgrim	b53b3c8d1c	[CostModel][X86] getCastInstrCost - attempt to match custom cast/conversion before legalized types. Move the (SSE-only) generic, legalized type conversion matching after the specific,custom conversion cases, allowing us to properly provide cost overrides. The next step will be to clean up some of the weird existing costs and then to enable AVX+ legalized costs, which will let us strip out a lot of the cost tables entries.	2021-07-01 12:06:40 +01:00
Simon Pilgrim	3ab40c9e92	[CostModel][X86] Adjust fp<->int vXi32 AVX1+ costs based on llvm-mca reports Based off the worse case numbers generated by D103695, the AVX1/2/512 sitofp/uitofp/fptosi/fptoui costs were higher than necessary (based off instruction counts instead of actual throughput). The SSE costs still need further fixes, but I hit an issue with the order in which SSE costs are checked - we need to check CUSTOM costs (with non-legal types) first, and then fallback to LEGALIZED types. I'm looking at this now, and this should let us start thinning out a lot of the duplicates in the costs tables. Then we can finally start work on vXi64 / vXi16 / vXi8 / vXi1 integers, which should let us look at sub-128-bit vectorization (D103925).	2021-06-30 15:23:34 +01:00
Simon Pilgrim	b21f333bb4	[CostModel][X86] Improve AVX1/AVX2 truncation costs Based off the worse case numbers generated by D103695, we were overestimating the cost of a number of vector truncations: AVX2: v2i32->v2i8, v2i64->v2i16 + v4i64->v4i32 AVX1: v2i32->v2i8, v4i64->v4i16 + v16i16->v16i8 Once we have a working set of conversion costs, the intention is to cleanup the tables and use legalized types a lot more to reduce the number of entries we currently have.	2021-06-08 10:41:03 +01:00
Simon Pilgrim	779c968967	[CostModel][X86] Add 512-bit bswap costs	2021-06-06 22:36:34 +01:00
Simon Pilgrim	3d02a9f278	[CostModel][X86] Improve AVX512 FDIV costs Add missing v16f32/v8f64 costs and adjust other costs as well based off the SkylakeServer model	2021-06-06 21:41:05 +01:00
Simon Pilgrim	0c78aae183	[CostModel][X86] Improve accuracy of sext/zext to 256-bit vector costs on AVX1 targets Determined from llvm-mca analysis (btver2 vs bdver2 vs sandybridge), the split+extends+concat sequence on AVX1 capable targets are cheaper than the #ops that the cost was previously based on.	2021-05-27 18:17:50 +01:00
Simon Pilgrim	154058d3b3	[CostModel][X86] AVX512 truncation ops are slower than cost models indicate. The SkylakeServer model (and later IceLake/TigerLake targets according to Agner) have the PMOV truncations as uops=2, rthroughput=2 instructions. Noticed while trying to reduce the diffs between cost tables and llvm-mca analysis.	2021-05-27 16:07:42 +01:00
Roman Lebedev	395eb3f2e5	[NFC][X86] clang-format X86TTIImpl::getInterleavedMemoryOpCostAVX2() I plan to make changes to it, and undoing formatting each time is not going to be fun.	2021-05-26 12:27:47 +03:00
Simon Pilgrim	c8ca7d77d1	[CostModel][X86] Improve accuracy of 256-bit non-uniform vector shifts on AVX1 Determined from llvm-mca analysis, AVX1 capable targets have a higher throughput for VPBLENDVB and shuffle ops, making it cheaper to perform shift+shuffle/select shift patterns.	2021-05-25 17:31:45 +01:00
Simon Pilgrim	414d8672a3	[CostModel][X86] Improve accuracy of vXi64 vector non-uniform shift costs on AVX2+ targets rG1ad4f887bd7692a9e63fb42586f0ece366f2fe01 incorrectly assumed that vXi64 non-uniform shifts were slow like vXi32 were - but llvm-mca (+Agner) both confirm that Haswell/Broadwell are full rate.	2021-05-25 15:58:23 +01:00
Simon Pilgrim	491a1dbd4b	[CostModel][X86] Improve accuracy of vXi8/vXi16 vector non-uniform shift costs on AVX2/AVX512 targets Determined from llvm-mca analysis, AVX2+ capable targets have a higher throughput for VPBLENDVB and VPMOVZX ops, making it cheaper to perform shift+select patterns for vXi8 shifts or extend/shift/truncate for vXi16 shifts. Similarly AVX512BW can perform vXi8 as extend/shift/truncate patterns.	2021-05-25 11:35:57 +01:00
Roman Lebedev	7cf7fe8908	[X86][Costmodel] getMaskedMemoryOpCost(): don't scalarize non-power-of-two vectors with legal element type This follows in steps of similar `getMemoryOpCost()` changes, D100099/D100684. Intel SDM, `VPMASKMOV — Conditional SIMD Integer Packed Loads and Stores`: ``` Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero. ``` I.e., if mask is all-zeros, any address is fine. Masked load/store's prime use-case is e.g. tail masking the loop remainder, where for the last iteration, only first some few elements of a vector exist. So much similarly, i don't see why must we scalarize non-power-of-two vectors, iff the element type is something we can masked- store/load. We simply need to legalize it, widen the mask, and be done with it. And we even already count the cost of widening the mask. Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D102990	2021-05-24 20:09:54 +03:00
Simon Pilgrim	f196122112	[CostModel][X86] Add missing SSE41 v2iX sext/zext costs Also fix existing v4i8->v4i16 sext cost to match the equivalents	2021-05-24 15:53:43 +01:00
Simon Pilgrim	6129236a8f	[CostModel][X86] Improve accuracy of vector non-uniform shift costs on XOP/AVX2 targets By llvm-mca analysis, Haswell/Broadwell has a non-uniform vector shift recip-throughput cost of the AVX2 targets at 2 for both 128 and 256-bit vectors - XOP capable targets have better 128-bit vector shifts so improve the fallback in those cases.	2021-05-24 14:18:21 +01:00
Simon Pilgrim	4fa97550f8	[CostModel][X86] Improve accuracy of vXi64 MUL costs on AVX2/AVX512 targets By llvm-mca analysis, Haswell/Broadwell has the worst v4i64 recip-throughput cost of the AVX2 targets at 6 (vs the currently used cost of 8). Similarly SkylakeServer (our only AVX512 target model) implements PMULLQ with an average cost of 1.5 (rounded up to 2.0), and the PMULUDQ-sequence (without AVX512DQ) as a cost of 6.	2021-05-24 09:48:32 +01:00
Simon Pilgrim	f8e2239648	[CostModel][X86] Align v2i64 MUL costs on SSE42+ targets with worst case Based on worst case of sandybridge (which seems to match nehalem for this SSE sequence) (vs btver2 + bdver2) llvm-mca analysis	2021-05-23 16:20:57 +01:00
Simon Pilgrim	f6dc684c0c	[CostModel][X86] Align v4i64 MUL costs on AVX1 targets with worst case Based on worst case of sandybridge (vs btver2 + bdver2) llvm-mca analysis - which is a lot less than what we were predicting (I think based off total uop count).	2021-05-22 20:07:55 +01:00
Simon Pilgrim	0bdfd66523	[CostModel][X86] Pull out X86/X64 scalar int arithmetric costs from SSE tables. NFCI. These aren't dependent on any SSE level (and don't tend to get quicker either).	2021-05-22 16:13:49 +01:00
Simon Pilgrim	99610346a2	[CostModel][X86] vXi8 MUL is always promoted to vXi16	2021-05-22 11:56:49 +01:00
Simon Pilgrim	1482151467	[CostModel][X86] Improve v8i32 MUL costs on AVX1 targets to account for slower btver2 BTVER2 has a 2 cycle throughput for v4i32 multiplies (same as SSE41 targets), which is only partially hidden by the subvector extracts/insert when splitting v8i32.	2021-05-22 11:13:07 +01:00
Roman Lebedev	325fca859f	Reland [X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): use getMemoryOpCost() Now that getMemoryOpCost() correctly handles all the vector variants, we should no longer hand-roll our own version of it, but use it directly. The AVX512 variant probably needs a similar change, but there it is less obvious. This was initially landed in 69ed93a4355123a45c1d7216aea7cd53d07a361b, but was reverted in 6b95fd199d96e3ba5c28a23b17b74203522bdaa8 because the patch it depends on was reverted.	2021-05-22 11:47:08 +03:00
Roman Lebedev	3d2511b099	Reland [X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again Instead of handling power-of-two sized vector chunks, try handling the large vector in a stream mode, decreasing the operational vector size once it no longer works for the elements left to process. Notably, this improves costs for overaligned loads - loading padding is fine. This more directly tracks when we need to insert/extract the YMM/XMM subvector, some costs fluctuate because of that. This was initially landed in c02476f3158f2908ef0a6f628210b5380bd33695, but reverted in 5fddc3312bad7e62493f1605385fad5e589e6450, because the code made some very optimistic assumptions about invariants that didn't hold in practice. Reviewed By: RKSimon, ABataev Differential Revision: https://reviews.llvm.org/D100684	2021-05-22 11:46:32 +03:00
Simon Pilgrim	b944c68941	[CostModel][X86] Improve f64/v2f64/v4f64 FMUL costs on AVX1 targets to account for slower btver2 BTVER2 has a weaker f64 multiplier that other AVX1-era targets, so we need to bump the worst case cost slightly - llvm-mca reports the new vectorization in simplebb is beneficial on btver2, bdver2 and sandybridge AVX1 targets	2021-05-21 18:12:13 +01:00
Simon Pilgrim	271ac585d4	[CostModel][X86] Improve fneg costs These are always lowered as xor ops, so are always cheap	2021-05-21 17:23:45 +01:00
Simon Pilgrim	02b1451e34	[CostModel][X86] Tweak fptoui v4f32->v4i32 + v8f32->v8i32 SSE/AVX costs Adjust for worst case for atom/slm (SSE), btver2/sandybridge (AVX1) and haswell/znver* (AVX2)	2021-05-21 12:09:31 +01:00
Simon Pilgrim	e2bdc7fda9	[CostModel][X86] Match SSE41 legalized conversion costs as well as SSE2	2021-05-21 11:42:22 +01:00
Simon Pilgrim	61e6792674	[CostModel][X86] Add uitpfp v4f32->v4i32 + v8f32->v8i32 SSE/AVX costs These were using (default) scalarized values.	2021-05-21 11:30:15 +01:00
Simon Pilgrim	145caddc0a	[CostModel][X86][AVX2] Improve 256-bit vector non-uniform shifts costs Haswell, Excavator and early Ryzen all have slower 256-bit non-uniform vector shifts (confirmed on AMDSoG/Agner/instlatx64 and llvm models) - so bump the worst case costs accordingly. Noticed while investigating PR50364	2021-05-20 12:16:16 +01:00
Simon Pilgrim	ef6c8d6574	[X86][AVX] Cleanup AVX2 vector integer truncation costs Noticed while investigating PR50364, the truncation costs for v4i64->v4i16/v4i8 and v8i32->v8i8 were way too optimistic for a shuffle sequence that usually matches the AVX1 codegen (they matched AVX512 numbers which have actual truncation instructions!).	2021-05-18 13:07:29 +01:00
Roman Lebedev	7a524b3d94	Revert "[X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again" As reported in post-commit feedback, this has issues with e.g. <16 x i1>: https://llvm.godbolt.org/z/jxPvdGEW4 This reverts commit c02476f3158f2908ef0a6f628210b5380bd33695.	2021-05-14 00:03:36 +03:00

1 2 3 4 5 ...

519 Commits