FEX-Emu/FEX - FEX - Gitea: Git with a cup of tea

mirror of https://github.com/FEX-Emu/FEX.git synced 2025-02-12 18:39:18 +00:00

Author	SHA1	Message	Date
Ryan Houdek	6403290019	FEXCore: Renames raw FLAGS location names to signify they can't be used directly Six of the EFLAGS can't be used directly in a bitmask because they are either contained in a different flags location or has multiple bits stored in it. SF, ZF, CF, OF are stored in ARM's NZCV format in offset 24. PF calculation is deferred but stored in the regular offset. AF is also deferred in relation to the PF but stored in the regular offset. These /need/ to be reconstructed using the `ReconstructCompactedEFLAGS` function when wanting to read the EFLAGS. When setting these flags they /need/ to be set using `SetFlagsFromCompactedEFLAGS`. If either of these functions are not used when managing EFLAGs then the internal representation will get mangled and the state will be corrupted. Having a little `_RAW` on these to signify that these aren't just regular single bit representations like the other flags in EFLAGS should make us puzzle about this issue before writing more broken code that tries accessing it directly.	2023-10-08 11:51:11 -07:00
Ryan Houdek	22590dde77	FEXCore: Implements support for RPRES This allows us to use reciprocal instructions which matches precision of what x86 expects rather than converting everything to float divides. Currently no hardware supports this, and even the upcoming X4/A720/A520 won't support it, but it was trivial to implement so wire it up.	2023-10-07 23:13:47 -07:00
Ryan Houdek	559cf6491a	InstCountCI: Support overriding AFP features Also disable AFP under the vixl simulator by default since it doesn't support it.	2023-10-07 11:48:42 -07:00
Ryan Houdek	8a51bb7a61	FEXCore: Support CpuState relative vector named constants The motivation towards just having a pointer array in CpuState was that initialization was fairly cheap and that we have limited space inside the encoding depending on what we want to do. Initialization cost is still a concern but doing a memcpy of 128-bytes isn't that big of a deal. Limited space in CpuState, while a concern isn't a significant one. - Needs to currently be less than 1 page in size - Needs to be under the architectural offset limitations of loadstore scaled offsets. Which is 65KB for 128-bit vectors Still keeps the pointer array around for cases when we would need synthesize an address offset and it's just easier to load the process-wide table. The performance improvement here is removing the dependency in the ldr+ldr chain. In microbenchmarks this has shown to have an improvement of ~4% by removing this dependency chain on Cortex-X1C.	2023-10-04 20:56:29 -07:00
Ryan Houdek	ee6debe8fd	FEXCore: Adds DividePow2 helper	2023-10-04 20:56:29 -07:00
Ryan Houdek	98789a8039	FEXCore: Implement support for AVX2 feature detection	2023-09-28 19:57:08 -07:00
Billy Laws	51f8c83c76	Context: Add an alternative thread-oriented execute function	2023-09-22 10:12:40 -07:00
Ryan Houdek	31564354b1	FEXCore: Removes vestigial Interpreter code	2023-09-21 15:49:49 -07:00
Ryan Houdek	fea72ce19c	Merge pull request #3120 from Sonicadvance1/more_optimal_x87 FEXCore: Support preserve_all ABI for interpreter fallbacks	2023-09-21 15:35:37 -07:00
Alyssa Rosenzweig	c52741c813	FEXCore: Gut interpreter It is scarcely used today, and like the x86 jit, it is a significant maintainence burden complicating work on FEXCore and arm64 optimization. Remove it, bringing us down to 2 backends. 1 down, 1 to go. Some interpreter scaffolding remains for x87 fallbacks. That is not a problem here. Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>	2023-09-21 12:48:12 -04:00
Ryan Houdek	ca6570d5de	FEXCore/Include: Adds SPDX identifier	2023-09-18 22:13:10 -07:00
Ryan Houdek	838293c2f0	FEXCore: Remove unused FallbackhandlerIndex LoadFCW We removed this once passing in FCW explicitly.	2023-09-18 17:06:46 -07:00
Ryan Houdek	0c5c146fcf	FEXCore/JitSymbols: Buffer writes to reduce overhead While this interface is usually pretty fast because it is a write and forget operation, this has issues when there are multiple threads hitting the perf map file at the same time. In particular this interface becomes a bottleneck due to a locking mutex on writes in the kernel. The situations when this bottleneck occurs is when a bunch of threads get spawned and they are all jitting code as quickly as possible. In particular Geekbench's clang benchmark hits this hard where each CPU thread spends ~40% CPU time on all eight CPU threads because they are stalled waiting for this mutex to unlock. To work around this issue, buffer the writes a small amount. Either up to a page-ish of data or 100ms of time. This completely eliminates threads waiting on the kernel mutex. - Around a page of buffer space was chosen by profiling Geekbench's clang benchmark and seeing how frequently it was still writing. - 1024 bytes was still fairly aggressive, 4096 seemed fine. - 100ms was chosen to ensure we don't wait /too/ long to write JIT symbols. - In most cases 100ms is enough that you won't notice the blip in perf. One thing of note is that with profiling enabled and checking the time on every JIT block still ends up with 2-3% CPUtime in vdso clock_gettime. We can improve this by using the cyclecounter directly since that is still guaranteed to be monotonic. Maybe we'll come back to that if it is actually an issue here.	2023-09-16 17:52:46 -07:00
Ryan Houdek	d5782567e8	Merge pull request #3077 from Sonicadvance1/x86_shifted FEXCore: Implements support for shifted bitwise ops	2023-09-15 08:09:35 -07:00
Ryan Houdek	d81d89c4fb	IR: Changes Select operation to not have implicit sizes Changes the helper which all the source uses to still calculate the size implicitly. This is going to take a while to convert all implicit uses over to the explicit operation. Get us started by at least having the IR operation itself be explicit.	2023-09-14 20:48:16 -07:00
Ryan Houdek	db5056f275	OpcodeDispatcher: Implement shufps with VTBL2 in worst case In the case that source registers are sequential then this turns in to a load of the vector constant (2 instructions) and the single tbl instruction. If the registers aren't sequential then the tbl turns in to 2 moves and then the single tbl, which with zero-cycle rename isn't too bad. Since this is a worst case option this is significantly better than the previous implementation doing a bunch of inserts which was always 9 instructions. We should still strive to implement faster versions without the use of TBL2 if possible but this makes it less of a concern.	2023-09-13 11:31:20 -07:00
Ryan Houdek	e9d96ce538	IR: Implements support for VTBL2 Skips implementing it for the x86 JIT because that's a bit of a nightmare to think about. The ARM64 implementation requires sequential registers which means if the incoming sources aren't sequential then we need to move the sources in to the two vector temporaries. This is fine since we have zero-cycle vector renames and the alternative is slower.	2023-09-13 11:31:20 -07:00
Ryan Houdek	b453439968	HostFeatures: Detect FlagM/2 Currently unused but at least detect the feature so that our Arm64 JIT can use it in the future.	2023-09-11 16:41:30 -07:00
Ryan Houdek	863331b117	FEXCore: Implements support for shifted bitwise ops This wasn't implemented initially for the interpreter and x86 JIT. This meant we are maintaining two codepaths. Implement these operations in the interpreter and x86 JIT so we no longer need to do that. The emitted code in the x86 JIT is hot garbage, but it's only necessary for correctness testing, not performance testing there.	2023-09-11 13:17:35 -07:00
Alyssa Rosenzweig	e6db2d0b96	IR: Remove phi nodes It turns out that pure SSA isn't a great choice for the sort of emulation we do. On one hand, it discards information from the guest binary's register allocation that would let us skip stuff. On the other hand, it doesn't have nearly as many benefits in this setting as in a traditional compiler... We really don't want to do global RA or really any global optimization. We assume the guest optimizer did its job for x86, we just need to clean up the mess left from going x86 -> arm. So we just need enough SSA to peephole optimize. My concrete IR proposals are that: * SSA values must be killed in the same block that they are defined. * Explicit LoadGPR/StoreGPR instructions can be used for global persistence. * LoadGPR/StoreGPR are eliminated in favour of SSA within a block. This has a lot of nice properties for our setting: * Except for some internal REP instruction emulation (etc), we already have registers for everything that escapes block boundaries, so this form is very easy to go into -- straightforward local value numbering, not a full into SSA pass. * Spilling is entirely local (if it happens at all), since everything is in registers at block boundaries. This is excellent, because Belady's algorithm lets us spill nearly optimally in linear-time for individual blocks. (And the global version of Belady's algorithm is massively more complicated...) A nice fit for a JIT. Relatedly, it turns out allowing spilling is probably a decent decision, since the same spiller code can be used to rematerialize constants in a straightforward way. This is an issue with the current RA. * Register assignment is entirely local. For the same reason, we can assign registers "optimally" in linear time & memory (e.g. with linear scan). And the impl is massively simpler than a full blown SSA-based tree scan RA. For example, we don't have to worry about parallel copies or coalescing phis or anything. Massively nicer algorithm to deal with. * SSA value names can be block local which makes the validation implicit :~) It also has remarkably few drawbacks, because we didn't want to do CFG global optimization anyway given our time budget and the diminishng returns. The few global optimizations we might want (flag escape analysis?) don't necessarily benefit from pure SSA anyway. Anyway, we explicitly don't want phi nodes in any of this. They're currently unused. Let's just remove them so nobody gets the bright idea of changing that. Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>	2023-09-05 16:35:12 -04:00
Billy Laws	3792f707dc	FEXLoader: Convert between abridged/full tag fmts in signal dispatch X86 fpstate expects FTW to be saved in the FSAVE format, whereas X64 fpstate expects it to be saved in the abridged format used by FXSAVE.	2023-09-02 09:17:33 -07:00
Billy Laws	cb49373f47	FEXCore: Rework X87 tag word handling The FXSAVE and FSAVE tag words are written out in different formats, with FXSAVE using an abridged version that lacks the zero/special/valid distinction. Switch to using this abridged version internally for simplicity, and to allow the calculation of zero/special/valid distinction to be deferred until an fxsave instruction (in the future, currently the distinction is ignored and only valid/empty states are possible).	2023-09-02 09:17:33 -07:00
Ryan Houdek	435f03c703	Context: Adds helper to reconstruct and consume packed EFLAGS Currently FEX's internal EFLAGS representation is a perfect 1:1 mapping between bit offset and byte offset. This is going to change with #3038. There should be no reason that the frontend needs to understand how to reconstruct the compacted flags from the internal representation. Adds context helpers and moves all the logic to FEXCore. The locations that previously needed to handle this have been converted over to use this.	2023-09-02 07:05:54 -07:00
Ryan Houdek	1446d4fe12	IR: Adds support for named vector zero This is useful for caching a zero register vector which we use in various locations. This will be abused soon.	2023-08-30 18:59:38 -07:00
Ryan Houdek	81a32c3998	FEXCore: Allows disabling telemetry at runtime This is useful for InstCountCI so you can disable the telemetry gathering even if enabled so it doesn't affect the CI system.	2023-08-30 12:59:41 -07:00
Ryan Houdek	d8f131fa3d	Arm64: Optimize AESKeyGenAssist We can load the swizzle table from our constant pool now. This removes the only usage of VTMP3 from our Arm64 JIT. I would say the this is now optimal for the version without RCON set. With RCON we could technically make some of the move of the constant more optimal.	2023-08-30 12:15:09 -07:00
Ryan Houdek	572cc57aa3	IR: Adds printer for OpSize	2023-08-30 11:32:43 -07:00
Ryan Houdek	f741ebf970	IR: Removes implicit sized add Saw a few locations in here that we operate things at 64-bit unconditionally around pointer calculation. Will be coming back for those when running in 32-bit mode. This is the last of the implicit sized ALU operations! After this I'll be going through the IR more individually to try and remove any stragglers. Then should be able to start cleaning up and actually optimizing GPR operations.	2023-08-29 22:26:51 -07:00
Ryan Houdek	e8b767b553	IR: Removes implicit sized bfe This one is a bit of a mess, looking forward to coming back and cleaning this up.	2023-08-29 19:43:39 -07:00
Ryan Houdek	9e70aa4192	IR: Removes implicit sized and	2023-08-28 22:43:21 -07:00
Ryan Houdek	b5dc6a69c7	IR: Removes implicit sized sub	2023-08-28 22:05:02 -07:00
Ryan Houdek	a276b37252	IR: Removes bfi from variable size This one was already explicit sized. Just convert it over to OpSize.	2023-08-28 21:31:37 -07:00
Ryan Houdek	8bc84c202c	IR: Removes implicit sized xor	2023-08-28 19:51:14 -07:00
Ryan Houdek	e9a3848602	Merge pull request #3027 from Sonicadvance1/remove_implicit_andn IR: Removes implicit sized andn	2023-08-28 19:39:49 -07:00
Ryan Houdek	1699ec9a76	IR: Removes implicit sized andn	2023-08-28 19:16:16 -07:00
Ryan Houdek	db6c8852fc	IR: Removes implicit sized or	2023-08-28 19:06:05 -07:00
Ryan Houdek	65dc6f3e90	IR: Removes implicit sized lshr	2023-08-28 18:16:56 -07:00
Ryan Houdek	60c4438780	IR: Removes implicit sized lshl	2023-08-28 17:50:41 -07:00
Ryan Houdek	898ce1ce8f	IR: Removes implicit sized UMulH	2023-08-28 17:20:55 -07:00
Ryan Houdek	aa8dfd6af1	IR: Removes implicit sized UMul	2023-08-28 17:20:55 -07:00
Ryan Houdek	6a6d808b0d	IR: Removes implicit sized MulH	2023-08-28 17:20:55 -07:00
Ryan Houdek	fac5b2ac72	IR: Removes implicit sized Mul	2023-08-28 17:20:55 -07:00
Ryan Houdek	b9e4a1423f	IR: Removes sext IR helper You hold no power here IR operation.	2023-08-28 17:03:38 -07:00
Ryan Houdek	c0bb6a053f	IR: Removes implicit sized {Create,Extract}ElementPair	2023-08-28 16:50:00 -07:00
Ryan Houdek	a36427d01e	IR: Removes implicit sized CAS	2023-08-28 07:12:31 -07:00
Ryan Houdek	594baff705	IR: Removes implicit sized CASPair	2023-08-28 07:12:31 -07:00
Ryan Houdek	e9f2bc037f	IR: Removes non-opsize AtomicAdd/Sub/And/Or These were unused	2023-08-28 07:12:31 -07:00
Ryan Houdek	6dcfd6eb73	IR: Removes non-opsize AtomicXor	2023-08-28 07:12:31 -07:00
Ryan Houdek	4f8a63459c	IR: Removes non-opsize AtomicSwap	2023-08-28 07:12:31 -07:00
Ryan Houdek	ab230bf527	IR: Removes non-opsize AtomicFetchAdd	2023-08-28 07:12:31 -07:00

1 2 3 4

194 Commits