mirror of https://github.com/FEX-Emu/FEX.git synced 2025-03-06 21:47:32 +00:00

FEXCore/docs: Adds programmer documentation about memory model emulation

I keep needing to look these up to remember the limitations. Add a doc
file so I can more easily point to the information.

2024-05-31 10:36:48 -07:00

5.2 KiB

Raw Blame History

What is x86-TSO and what is different compared to ARM's weak memory model?

x86's memory model is a very strictly coherent memory model that effectively mandates that all memory accesses are "atomic". While atomicity is actually a bit more strict, we actually need to emulate it in ARM using atomic instructions. We are also required to emulate this strictness with unaligned accesses, which is due to x86 CPUs allowing unaligned atomics for "free" within a cacheline. Intel also takes this a step more and allowing full atomics with a feature called "split-locks", AMD gains this same feature in Zen 5.

Emulating loads

Due to x86 SIB addressing, this can happen on most instructions. FEX emulates these in a variety of ways depending on features. Most instructions are emulated with an atomic instruction but we also implement a feature called "half-barrier" atomics for unaligned atomics.

Base ARMv8.0

Addressing limitations
- Register only This is emulated using an atomic load instruction plus a nop.
On unaligned access the code gets backpatched to a non-atomic load plus a memory barrier

FEAT_LRCPC

Addressing limitations
- Register only This matches the base ARMv8.0 implementation but adds new instructions that match x86-TSO behaviour, making the emulation slightly quicker.
On unaligned access it still gets backpatched to non-atomic load plus a memory barrier.

FEAT_LRCPC2

Addressing limitations
- Register plus 9-bit signed immediate (-256, 255) Adds some new instructions that allow immediate encoding inside of the previous LRCPC instructions

FEAT_LRCPC3

Adds a handful of GPR instructions that aren't super interesting

FEX doesn't currently implement these since no hardware supports it.

ldapr - Post-index load for stack
ldiapr - Post-index load pair for stack
stilp - pre-index store pair for stack
stlr - pre-index store for stack

Emulating stores

Again due to x86 SIB addressing, this can also happen on most instructions. There are less options for FEX with this extension, so in most cases this just turns in to an atomic store with half-barrier backpatching for unaligned accesses

FEAT_LRCPC, FEAT_LRCPC2

Adds nothing for emulating stores

FEAT_LRCPC3

Emulating atomic instructions

x86 has atomic memory operations that can do a variety of operations. For unaligned atomic operations FEX will emulate the operation inside the signal handler if it happens to be unaligned.

CASPair - cmpxchg

Base ARMv8.0

Addressing limitations
- Register only This is emulated with a ldaxp+stlxp pair of instructions.

FEAT_LSE

Addressing limitations
- Register only Adds a new caspal instruction that does the operation almost exactly like x86.

CAS - cmpxchg8b/cmpxchg16b

Base ARMv8.0

Addressing limitations
- Register only Similar to CASPair but now only uses a ldaxr+stlxr pair

FEAT_LSE

Addressing limitations
- Register only Similar to CASPair adds a new casal instruction that operates basically like x86

AtomicFetch

Op from the following list

Add
Sub
And
CLR
Or
Xor
Neg
Swap

Base ARMv8.0

Addressing limitations
- Register only All operations get emulated with an ldaxr+stlxr+ instruction

FEAT_LSE

Addressing limitations
- Register only Almost all operations now have a native atomic memory operation instruction. The only outlier is atomicNeg which doesn't have an LSE equivalent and uses the ARMv8.0 implementation.

Vector loads

Since almost all memory accesses on x86 are TSO, this includes vector operations.

Base ARMv8.0

Addressing limitations
- Register plus 9-bit signed immediate (-256, 255)
- Register plus 12-bit unsigned scaled immediate (Scaled by access size) Emulated using half-barriers, which means a load+dmb

FEAT_LRCPC3

LDAP1 added for element loads. Register only address encoding
LDAPUR added for vector register loads, supports 9-bit simm offset

Vector stores

Just like loads, these are emulated using half-barriers

Base ARMv8.0

Addressing limitations
- Register plus 9-bit signed immediate (-256, 255)
- Register plus 12-bit unsigned scaled immediate (Scaled by access size) Emulated using half-barriers, which means a dmb+str

5.2 KiB

Raw Blame History

What is x86-TSO and what is different compared to ARM's weak memory model?

Emulating loads

Base ARMv8.0

FEAT_LRCPC

FEAT_LRCPC2

FEAT_LRCPC3

Emulating stores

FEAT_LRCPC, FEAT_LRCPC2

FEAT_LRCPC3

Emulating atomic instructions

CASPair - cmpxchg

Base ARMv8.0

FEAT_LSE

CAS - cmpxchg8b/cmpxchg16b

Base ARMv8.0

FEAT_LSE

AtomicFetch

Op from the following list

Base ARMv8.0

FEAT_LSE

Vector loads

Base ARMv8.0

FEAT_LRCPC3

Vector stores

Base ARMv8.0

FEAT_LRCPC3

Addressing limitations depending on operating mode

GPR loadstores

TSO Emulation disabled

TSO Emulation enabled

Vector loadstores

TSO Emulation disabled

TSO Emulation enabled

TSO Emulation enabled (FEAT_LRCPC3)

Atomic memory operations

5.2 KiB Raw Blame History

What is x86-TSO and what is different compared to ARM's weak memory model?

Emulating loads

Base ARMv8.0

FEAT_LRCPC

FEAT_LRCPC2

FEAT_LRCPC3

Emulating stores

FEAT_LRCPC, FEAT_LRCPC2

FEAT_LRCPC3

Emulating atomic instructions

CASPair - cmpxchg

Base ARMv8.0

FEAT_LSE

CAS - cmpxchg8b/cmpxchg16b

Base ARMv8.0

FEAT_LSE

AtomicFetch

Op from the following list

Base ARMv8.0

FEAT_LSE

Vector loads

Base ARMv8.0

FEAT_LRCPC3

Vector stores

Base ARMv8.0

FEAT_LRCPC3

Addressing limitations depending on operating mode

GPR loadstores

TSO Emulation disabled

TSO Emulation enabled

Vector loadstores

TSO Emulation disabled

TSO Emulation enabled

TSO Emulation enabled (FEAT_LRCPC3)

Atomic memory operations

5.2 KiB

Raw Blame History