Do it explicitly for sve-256 and punt on optimizing, so we avoid regressing code
gen otherwise.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Rather than the context. Effectively a static register allocation scheme for
flags. This will let us optimize out a LOT of flag handling code, keeping things
in NZCV rather than needing to copy between NZCV and memory all the time.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Some opcodes only clobber NZCV under certain circumstances, we don't yet have
a good way of encoding that. In the mean time this hot fixes some would-be
instcountci regressions.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Again we need to handle this one specially because the dispatcher can't insert
restore code after the branch. It should be optimized in the near future, don't
worry.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Semantics differ markedly from the non-NZCV flags, splitting this out makes it a
lot easier to do things correctly imho. Gets the dest/src size correct
(important for spilling), as well as makes our existing opt passes skip this
which is needed for correctness at the moment anyway.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This fixes an issue where CPU tunables were ending up in the thunk
generator which means if your CPU doesn't support all the features on
the *Builder* then it would crash with SIGILL. This was happening with
Canonical's runners because they typically only support ARMv8.2 but we
are compiling packages to run on ARMv8.4 devices.
cc: FEX-2311.1
SHA instructions are very large right now and cause register spilling
due to their codegen. Ender Lilies has a really large block in a
function called `sha1_block_data_order` that was causing FEX to spill
NZCV flags incorrectly. The assumption which held true before NZCV
optimizations were a thing was that all flags were either 1-bit in an
8-bit container, or just 8-bit (x87 TOP flag).
NZCV host flags broke this assumption by making its flags 32-bit which
ended up breaking when encounting spilling situations.
Replace every instance of the Op overwrite pattern, and ban that anti-pattern
from the codebase in the future. This will prevent piles of NZCV related
regressions.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
The "create op with wrong opcode, then change the opcode" pattern is REALLY
dangerous. This does not address that. But when we start doing NZCV trickery, it
will get /more/ dangerous, and so it's time to add a helper and make the
convenient thing the safe(r) thing. This helper correctly saves NZCV /before/
the instruction like the real builders would. It also provides a spot for future
safety asserts if someone is motivated.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
It looks like currently FEX has a bug around implicit flag clobbering
with this pull request where an IR operation that implicitly clobbers
flags isn't correctly saving the NZCV flags before doing the operation.
Adds a unit test that specifically captures this issue. RAX will be 1 or
0 depending on if the flags are clobbered incorrectly or not.
When attempting to debug #3162 I had noticed spurious behaviour around
what I assumed to be eflags getting corrupt around inlined syscalls.
This turned out to be a red herring but to ensure we are still testing
this, create a fully fleshed out unit test.
This test ensures a couple of things.
1) A flag that is set or unset before a syscall doesn't have its data
corrupt
2) An inline syscall doesn't corrupt the eflags, checking the eflag
result after returning from the syscall.
3) A signal occuring while in an inline syscall returns the correct
eflags information in the signal handler information
This test gets accomplished by setting or unsetting a particular flag
and then calling the futex syscall in a way that is guaranteed to be
inlined and also wait forever. Then the parent thread will signal with a
SIGTERM and read back the signal information. It does this multiple
times for each flag we care about.
While tracking issues in #3162, I had encountered a random crash that I
started hunting. It was very quickly apparent that this crash was
unrelated to that PR. I just happened to be running a unittest that was
creating and tearing down a bunch of threads that exacerbated the
problem.
See as follows with the strace output:
```
[pid 269497] munmap(0x7fffde1ff000, 16777216) = 0
[pid 269497] munmap(0x7fffde1ff000, 16777216 <unfinished ...>
[pid 268982] mmap(NULL, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffde1ff000
[pid 269497] <... munmap resumed>) = 0
```
One thread is freeing some memory with munmap, another one then does a mmap and gets the same address back.
Nothing too crazy at initial glance, but taking a closelier look, we can
see that there are two strange oddities:
1) We are double unmapping the same address range through munmap
2) The second munmap is interrupted and returns AFTER the mmap.
This has the unfortunate side-effect that the mmap that just returned
the same address has actually just been unmapped! This was resulting in
spurious crashes around thread creation that was SUPER hard to nail
down.
The problem comes down to how code buffer objects are managed, in
particular how the Arm64Emitter and Dispatcher handled its buffers.
Arm64Emitter is inherited by two classes; Dispatcher, and Arm64JITCore.
On class destruction the emitter would free its internal tracking
buffer. Additionally on destruction, the Arm64JITCore would walk through
all of its CodeBuffers and free them. The problem ends up being that in
the Arm64JITCore, it would free its code buffers which also ended up
being the current active buffer bound to the Arm64Emitter. Thus causing
the Arm64Emitter to come back around and try to free the same buffer
again.
This is a double-free problem! and was only visible on thread exiting!
Can't track double frees with mmap and munmap with current tooling!
This problem typically didn't occur because of how fast the destruction
usually takes and jemalloc inbetween also typically means the problem
doesn't occur. Initially thinking this was a threaded pool allocator bug
because typically the new allocation would end up in there once a new
thread was spinning up.
Now we change behaviour, Arm64Emitter doesn't do any buffer management
itself, instead just passing an initial buffer on to its internal buffer
tracking if given one up front.
This leaves the Dispatcher and the Arm64JITCore to do their buffer
management and ensuring there is no double free.
The day is saved!
This mask was being used incorrectly, it's a GPR spill mask for host
GPRs not an index in to the SRA array. Search the array of SRA registers
for the first one in the mask first to use as a temporary.
Fixes an issue with 32-bit inline syscalls where the first register
being spilled was r8, which was beyond the size of SRA registers on
32-bit processes. This would cause FEX to read the value just after
x32::SRA which is x32::RA. This would mean it would use r20 as a
temporary, corrupting the register in the process.
I noticed this while poking at #3162, but also when I was looking at a
memory buffer ownership problem.