RPCS3/llvm: Old fork of llvm-mirror, used on older RPCS3 builds - llvm

RPCS3/llvm

mirror of https://github.com/RPCS3/llvm.git synced 2025-02-15 00:16:42 +00:00

Go to file

Chandler Carruth ee26c4120d [x86] Teach the cmov converter to aggressively convert cmovs with memory

operands into control flow.

We have seen periodically performance problems with cmov where one
operand comes from memory. On modern x86 processors with strong branch
predictors and speculative execution, this tends to be much better done
with a branch than cmov. We routinely see cmov stalling while the load
is completed rather than continuing, and if there are subsequent
branches, they cannot be speculated in turn.

Also, in many (even simple) cases, macro fusion causes the control flow
version to be fewer uops.

Consider the IACA output for the initial sequence of code in a very hot
function in one of our internal benchmarks that motivates this, and notice the
micro-op reduction provided.
Before, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.20 Cycles       Throughput Bottleneck: Port1

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    |           | 1.0 |           |           |     |     | CP | mov rcx, rdi
|   0*   |           |     |           |           |     |     |    | xor edi, edi
|   2^   | 0.1       | 0.6 | 0.5   0.5 | 0.5   0.5 |     | 0.4 | CP | cmp byte ptr [rsi+0xf], 0xf
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |    | mov rax, qword ptr [rsi]
|   3    | 1.8       | 0.6 |           |           |     | 0.6 | CP | cmovbe rax, rdi
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |    | cmp byte ptr [rcx+0xf], 0x10
|   0F   |           |     |           |           |     |     |    | jb 0xf
Total Num Of Uops: 9
```
After, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles       Throughput Bottleneck: Port5

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 0.5       | 0.5 |           |           |     |     |    | mov rax, rdi
|   0*   |           |     |           |           |     |     |    | xor edi, edi
|   2^   | 0.5       | 0.5 | 1.0   1.0 |           |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    | 0.5       | 0.5 |           |           |     |     |    | mov ecx, 0x0
|   1    |           |     |           |           |     | 1.0 | CP | jnbe 0x39
|   2^   |           |     |           | 1.0   1.0 |     | 1.0 | CP | cmp byte ptr [rax+0xf], 0x10
|   0F   |           |     |           |           |     |     |    | jnb 0x3c
Total Num Of Uops: 7
```
The difference even manifests in a throughput cycle rate difference on Haswell.
Before, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles       Throughput Bottleneck: FrontEnd

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | mov rcx, rdi
|   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rax, qword ptr [rsi]
|   3    | 1.0       | 1.0 |           |           |     |     | 1.0 |     |    | cmovbe rax, rdi
|   2^   | 0.5       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.5 |     |    | cmp byte ptr [rcx+0xf], 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jb 0xf
Total Num Of Uops: 8
```
After, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 1.50 Cycles       Throughput Bottleneck: FrontEnd

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | mov rax, rdi
|   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
|   2^   |           |     | 1.0   1.0 |           |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
|   1    |           | 1.0 |           |           |     |     |     |     |    | mov ecx, 0x0
|   1    |           |     |           |           |     |     | 1.0 |     |    | jnbe 0x39
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     |    | cmp byte ptr [rax+0xf], 0x10
|   0F   |           |     |           |           |     |     |     |     |    | jnb 0x3c
Total Num Of Uops: 6
```

Note that this cannot be usefully restricted to inner loops. Much of the
hot code we see hitting this is not in an inner loop or not in a loop at
all. The optimization still remains effective and indeed critical for
some of our code.

I have run a suite of internal benchmarks with this change. I saw a few
very significant improvements and a very few minor regressions,
but overall this change rarely has a significant effect. However, the
improvements were very significant, and in quite important routines
responsible for a great deal of our C++ CPU cycles. The gains pretty
clealy outweigh the regressions for us.

I also ran the test-suite and SPEC2006. Only 11 binaries changed at all
and none of them showed any regressions.

Amjad Aboud at Intel also ran this over their benchmarks and saw no
regressions.

Differential Revision: https://reviews.llvm.org/D36858

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@311226 91177308-0d34-0410-b5e6-96231b3b80d8

2017-08-19 05:01:19 +00:00

bindings

Update the Go bindings for r309426 (remove offset from llvm.dbg.value)

2017-07-28 22:44:44 +00:00

cmake

[CMake][LLVM] Remove duplicated library mask. Broken clang linking against clangShared

2017-08-10 13:37:58 +00:00

docs

[Lexicon] Add "GEP"

2017-08-18 15:35:53 +00:00

examples

[ORC][Kaleidoscope] Update Chapter 1 of BuildingAJIT to incorporate recent ORC

2017-08-15 19:20:10 +00:00

include

llvm-mt: Merge manifest namespaces.

2017-08-19 00:37:41 +00:00

lib

[x86] Teach the cmov converter to aggressively convert cmovs with memory

2017-08-19 05:01:19 +00:00

projects

Add temporary workaround to allow in-tree libc++ builds on Windows

2017-05-11 01:44:30 +00:00

resources

In MSVC builds embed a VERSIONINFO resource in our exe and DLL files.

2015-06-12 15:58:29 +00:00

runtimes

[CMake][runtimes] Support for building target variants

2017-08-16 19:13:45 +00:00

test

[x86] Teach the cmov converter to aggressively convert cmovs with memory

2017-08-19 05:01:19 +00:00

tools

llvm-mt: Merge manifest namespaces.

2017-08-19 00:37:41 +00:00

unittests

[Support] env vars with empty values on windows

2017-08-18 16:55:44 +00:00

utils

[lit] support unsetting env variables (again!)

2017-08-18 17:32:57 +00:00

.arcconfig

project_id is from another era in phabricator land and does not provide any value.

2016-09-27 15:47:29 +00:00

.clang-format

Test commit.

2014-03-02 13:08:46 +00:00

.clang-tidy

.clang-tidy: correct style name is 'camelBack' not 'lowerCase'.

2016-09-13 19:04:26 +00:00

.gitignore

gitignore: Ignore .vs folder (VS2017 config files)

2017-04-08 00:16:58 +00:00

CMakeLists.txt

Remove RISCV from LLVM_ALL_TARGETS in CMakeLists.txt

2017-08-13 18:49:33 +00:00

CODE_OWNERS.TXT

Remove the BBVectorize pass.

2017-06-30 07:09:08 +00:00

configure

Remove autoconf support

2016-01-26 21:29:08 +00:00

CREDITS.TXT

Another test commit

2017-07-01 03:24:06 +00:00

LICENSE.TXT

Bump year to 2017 in LICENSE.txt

2017-01-12 18:02:42 +00:00

llvm.spec.in

[Sparc] Implement i64 load/store support for 32-bit sparc.

2015-08-10 19:11:39 +00:00

LLVMBuild.txt

Remove the very substantial, largely unmaintained legacy PGO

2013-10-02 15:42:23 +00:00

README.txt

Test commit access

2017-08-18 02:39:28 +00:00

RELEASE_TESTERS.TXT

[RelTest] Diana is doing both releases now

2017-07-14 08:33:52 +00:00

README.txt

Low Level Virtual Machine (LLVM)
================================

This directory and its subdirectories contain source code for LLVM,
a toolkit for the construction of highly optimized compilers,
optimizers, and runtime environments.

LLVM is open source software. You may freely distribute it under the terms of
the license agreement found in LICENSE.txt.

Please see the documentation provided in docs/ for further
assistance with LLVM, and in particular docs/GettingStarted.rst for getting
started with LLVM and docs/README.txt for an overview of LLVM's
documentation setup.

If you are writing a package for LLVM, see docs/Packaging.rst for our
suggestions.

Languages

LLVM 52.9%

C++ 32.7%

Assembly 13.2%

Python 0.4%

C 0.4%

Other 0.3%