diff --git a/docs/Atomics.html b/docs/Atomics.html new file mode 100644 index 00000000000..92065a9b45e --- /dev/null +++ b/docs/Atomics.html @@ -0,0 +1,295 @@ + + +
+Historically, LLVM has not had very strong support for concurrency; some
+minimal intrinsics were provided, and volatile
was used in some
+cases to achieve rough semantics in the presence of concurrency. However, this
+is changing; there are now new instructions which are well-defined in the
+presence of threads and asynchronous signals, and the model for existing
+instructions has been clarified in the IR.
The atomic instructions are designed specifically to provide readable IR and + optimized code generation for the following:
+<atomic>
header.volatile
and
+ regular shared variables.__sync_*
builtins.static
+ variables with non-trivial constructors in C++.This document is intended to provide a guide to anyone either writing a + frontend for LLVM or working on optimization passes for LLVM with a guide + for how to deal with instructions with special semantics in the presence of + concurrency. This is not intended to be a precise guide to the semantics; + the details can get extremely complicated and unreadable, and are not + usually necessary.
+ +The basic 'load'
and 'store'
allow a variety of
+ optimizations, but can have unintuitive results in a concurrent environment.
+ For a frontend writer, the rule is essentially that all memory accessed
+ with basic loads and stores by multiple threads should be protected by a
+ lock or other synchronization; otherwise, you are likely to run into
+ undefined behavior. (Do not use volatile as a substitute for atomics; it
+ might work on some platforms, but does not provide the necessary guarantees
+ in general.)
From the optimizer's point of view, the rule is that if there
+ are not any instructions with atomic ordering involved, concurrency does not
+ matter, with one exception: if a variable might be visible to another
+ thread or signal handler, a store cannot be inserted along a path where it
+ might not execute otherwise. Note that speculative loads are allowed;
+ a load which is part of a race returns undef
, but is not
+ undefined behavior.
For cases where simple loads and stores are not sufficient, LLVM provides + atomic loads and stores with varying levels of guarantees.
+ +In order to achieve a balance between performance and necessary guarantees, + there are six levels of atomicity. They are listed in order of strength; + each level includes all the guarantees of the previous level except for + Acquire/Release.
+ +Unordered is the lowest level of atomicity. It essentially guarantees that + races produce somewhat sane results instead of having undefined behavior. + This is intended to match the Java memory model for shared variables. It + cannot be used for synchronization, but is useful for Java and other + "safe" languages which need to guarantee that the generated code never + exhibits undefined behavior. Note that this guarantee is cheap on common + platforms for loads of a native width, but can be expensive or unavailable + for wider loads, like a 64-bit load on ARM. (A frontend for a "safe" + language would normally split a 64-bit load on ARM into two 32-bit + unordered loads.) In terms of the optimizer, this prohibits any + transformation that transforms a single load into multiple loads, + transforms a store into multiple stores, narrows a store, or stores a + value which would not be stored otherwise. Some examples of unsafe + optimizations are narrowing an assignment into a bitfield, rematerializing + a load, and turning loads and stores into a memcpy call. Reordering + unordered operations is safe, though, and optimizers should take + advantage of that because unordered operations are common in + languages that need them.
+ +Monotonic is the weakest level of atomicity that can be used in
+ synchronization primitives, although it does not provide any general
+ synchronization. It essentially guarantees that if you take all the
+ operations affecting a specific address, a consistent ordering exists.
+ This corresponds to the C++0x/C1x memory_order_relaxed
; see
+ those standards for the exact definition. If you are writing a frontend, do
+ not use the low-level synchronization primitives unless you are compiling
+ a language which requires it or are sure a given pattern is correct. In
+ terms of the optimizer, this can be treated as a read+write on the relevant
+ memory location (and alias analysis will take advantage of that). In
+ addition, it is legal to reorder non-atomic and Unordered loads around
+ Monotonic loads. CSE/DSE and a few other optimizations are allowed, but
+ Monotonic operations are unlikely to be used in ways which would make
+ those optimizations useful.
Acquire provides a barrier of the sort necessary to acquire a lock to access
+ other memory with normal loads and stores. This corresponds to the
+ C++0x/C1x memory_order_acquire
. This is a low-level
+ synchronization primitive. In general, optimizers should treat this like
+ a nothrow call.
Release is similar to Acquire, but with a barrier of the sort necessary to
+ release a lock.This corresponds to the C++0x/C1x
+ memory_order_release
.
AcquireRelease (acq_rel
in IR) provides both an Acquire and a Release barrier.
+ This corresponds to the C++0x/C1x memory_order_acq_rel
. In general,
+ optimizers should treat this like a nothrow call.
SequentiallyConsistent (seq_cst
in IR) provides Acquire and/or
+ Release semantics, and in addition guarantees a total ordering exists with
+ all other SequentiallyConsistent operations. This corresponds to the
+ C++0x/C1x memory_order_seq_cst
, and Java volatile. The intent
+ of this ordering level is to provide a programming model which is relatively
+ easy to understand. In general, optimizers should treat this like a
+ nothrow call.
cmpxchg
and atomicrmw
are essentially like an
+ atomic load followed by an atomic store (where the store is conditional for
+ cmpxchg
), but no other memory operation operation can happen
+ between the load and store.
A fence
provides Acquire and/or Release ordering which is not
+ part of another operation; it is normally used along with Monotonic memory
+ operations. A Monotonic load followed by an Acquire fence is roughly
+ equivalent to an Acquire load.
Frontends generating atomic instructions generally need to be aware of the + target to some degree; atomic instructions are guaranteed to be lock-free, + and therefore an instruction which is wider than the target natively supports + can be impossible to generate.
+ +Predicates for optimizer writers to query: +
There are essentially two components to supporting atomic operations. The + first is making sure to query isSimple() or isUnordered() instead + of isVolatile() before transforming an operation. The other piece is + making sure that a transform does not end up replacing, for example, an + Unordered operation with a non-atomic operation. Most of the other + necessary checks automatically fall out from existing predicates and + alias analysis queries.
+ +Some examples of how optimizations interact with various kinds of atomic + operations: +
Atomic operations are represented in the SelectionDAG with
+ ATOMIC_*
opcodes. On architectures which use barrier
+ instructions for all atomic ordering (like ARM), appropriate fences are
+ split out as the DAG is built.
The MachineMemOperand for all atomic operations is currently marked as + volatile; this is not correct in the IR sense of volatile, but CodeGen + handles anything marked volatile very conservatively. This should get + fixed at some point.
+ +The implementation of atomics on LL/SC architectures (like ARM) is currently + a bit of a mess; there is a lot of copy-pasted code across targets, and + the representation is relatively unsuited to optimization (it would be nice + to be able to optimize loops involving cmpxchg etc.).
+ +On x86, all atomic loads generate a MOV
.
+ SequentiallyConsistent stores generate an XCHG
, other stores
+ generate a MOV
. SequentiallyConsistent fences generate an
+ MFENCE
, other fences do not cause any code to be generated.
+ cmpxchg uses the LOCK CMPXCHG
instruction.
+ atomicrmw xchg
uses XCHG
,
+ atomicrmw add
and atomicrmw sub
use
+ XADD
, and all other atomicrmw
operations generate
+ a loop with LOCK CMPXCHG
. Depending on the users of the
+ result, some atomicrmw
operations can be translated into
+ operations like LOCK AND
, but that does not work in
+ general.
On ARM, MIPS, and many other RISC architectures, Acquire, Release, and
+ SequentiallyConsistent semantics require barrier instructions
+ for every such operation. Loads and stores generate normal instructions.
+ atomicrmw
and cmpxchg
generate LL/SC loops.