This was a debug LoadConstant that would load the entry in to a temprary
register to make it easier to see what RIP a block was in.
This was implemented when FEX stopped storing the RIP in the CPU state
for every block. This is now no longer necessary since FEX stores the
in the tail data of the block.
This was affecting instructioncountci when in a debug build.
`eor <reg>, <reg>, <reg>` is not the optimal way to zero a vector
register on ARM CPUs. Instead we should move by constant or zero
register to take advantage of zero-latency moves.
This needs to default to 64-bit addresses, this was previously
defaulting to 32-bit which was meaning the destination address was
getting truncated. In a 32-bit process the address is still 32-bit.
I'm actually surprised this hasn't caused spurious SIGSEGV before this
point.
Adds a 32-bit test to ensure that side is tested as well.
This is more obvious. llvm-mca says TST is half the cycle count of CMN
for whatever it's defaulting to. dougallj's reference shows both as the
same performance.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
In the non-immediate cases, we can amortize some work between the two
flags to come out 1 instruction ahead.
In the immediate case, costs us an extra 2 instructions compared to
before we packed NZCV flags, but this mitigates a bigger instr count
regression that this PR would otherwise have. Coming out ahead will
require FlagM and smarter RA, but is doable.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Same technique as the left shifts. Gets rid of all our COND_FLAG_SET
use, which is good because it's a performance footgun.
Overall saves 17 instructions (!!!!) from the flag calculation code for
`sar eax, cl`.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Similar to the immediate case, but now we select between the entire old
and new NZCV registers. This is faster than selecting each bit
independently. Saves 11 instructions for calculating flags for "shl eax,
cl".
It is undefined in this case. We prefer to zero (rather than preserve
the existing value) as it avoids a costly RMW of the NZCV register.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We need to be careful to preserve V if needed. For `shl 1` and `shr 1`,
saves 2 instruction overall compared to before the PR. For `sar 1`,
saves 3 overall.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We can just return zero, no need to do a pointless Bfe. Saves yet
another instruction for GetPackedRLAG in a test I'm looking at.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We know that bit 0 is CF, so we can do CF first and then avoid setting
Original to 2 (for reserved) with a silly `or xzr, #2` instruction.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Faster sign/negate testing for 32-bit/64-bit inputs. This could maybe be
extended to 8/16-bit if we have FlagM but that's for later.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
If we can prove that a flag bit could not possibly be set, we can use
orlshl rather than bfi, which can be more efficient.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
In some cases we just want to insert in one bit at a time, add a helper
to zero the 4 flags together so we can avoid the extra RMW cycle at the
beginning.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We can set N more efficiently with some bit math, and zero ZCV at the
same time. In the future we'll be able to use TST for this to make it
even faster.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Later, this will let us take advantage of the arm64 flags. For
now, this just turns some strb's into bfi's for dubious benefit.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
If we read or write NZCV flags we need to call CalculateDeferredFlags on
block boundaries, if only to flush out the cached copy.
Also, when leaving a block we call it to flush out. This is annoyingly
invasive but I don't know of a better way to do this that doesn't
involve rearchitecting the dispatcher.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We'll add an extra caching layer in a moment so can't call _LoadFlag
directly and expect correct results.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Maps to arm64 tst, except properly SSA. This will need some RA support
to avoid redundant mrs/msr sequences.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
On arm64, orr (with a shifted register) is maybe fewer cycles and
definitely easier on the RA than bfi (=> fewer moves generated). So,
it's preferred when we know the corresponding bit is 0 in the
destination.
It's not useful on other targets, so it's gated behind a backend feature
bit.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reserve 4 bytes of "flags" to model the 32-bit arm64 NZCV register, so
we can start porting FEX's flag handling code over to using NZCV without
needing the whole compiler to be aware of instructions that might
clobber host flags.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This IR operation was limited to GPR only previously for the values
getting compared.
This adds support for GPRPair (and technically FPR) so that it can be
used directly with GPR pairs. I say technically FPR because the
IREmitter disallowed FPRs for the comparison, but this was already
supported in all of the backends, we just didn't ever use it.
Some minor changes to the constant prop pass to ensure that we don't try
to propagate a select in to a CondJump, otherwise pair comparisons would
be duplicated. This code is expecting to be able to merge a simple
comparison in to a `cbnz`, which doesn't happen with GPR pairs.
When cmake's `install` function is invoked with a relative path, then it
is interpreted as being relative to the `CMAKE_INSTALL_PREFIX` variable.
This variable follows both `DESTDIR` and `CMAKE_INSTALL_PREFIX` so it is
best to use relative addresses in the install path.
Thanks for the report Mike!
Fixes#2849