Memory Model Relaxation Annotations¶
Introduction¶
Memory Model Relaxation Annotations (MMRAs) are target-defined properties on instructions that can be used to selectively relax constraints placed by the memory model. For example:
The use of
VulkanMemoryModel
in a SPIRV program allows certain memory operations to be reordered acrossacquire
orrelease
operations.OpenCL APIs expose primitives to only fence a specific set of address spaces. Carrying that information to the backend can enable the use of faster synchronization instructions, rather than fencing all address spaces everytime.
MMRAs offer an opt-in system for targets to relax the default LLVM memory model. As such, they are attached to an operation using LLVM metadata which can always be dropped without affecting correctness.
Definitions¶
- memory operation
A load, a store, an atomic, or a function call that is marked as accessing memory.
- synchronizing operation
An instruction that synchronizes memory with other threads (e.g. an atomic or a fence).
- tag
Metadata attached to a memory or synchronizing operation that represents some target-defined property regarding memory synchronization.
An operation may have multiple tags that each represent a different property.
A tag is composed of a pair of metadata string: a prefix and a suffix.
In LLVM IR, the pair is represented using a metadata tuple. In other cases (comments, documentation, etc.), we may use the
prefix:suffix
notation. For example:!0 = !{!"scope", !"workgroup"} # scope:workgroup !1 = !{!"scope", !"device"} # scope:device !2 = !{!"scope", !"system"} # scope:system
Note
The only semantics relevant to the optimizer is the “compatibility” relation defined below. All other semantics are target defined.
Tags can also be organised in lists to allow operations to specify all of the tags they belong to. Such a list is referred to as a “set of tags”.
!0 = !{!"scope", !"workgroup"} !1 = !{!"sync-as", !"private"} !2 = !{!0, !2}
Note
If an operation does not have MMRA metadata, it’s treated as if it has an empty list (
!{}
) of tags.Note that it is not an error if a tag is not recognized by the instruction it is applied to, or by the current target. Such tags are simply ignored.
Both synchronizing operations and memory operations can have zero or more tags attached to them using the
!mmra
syntax.For the sake of readability in examples below, we use a (non-functional) short syntax to represent MMMRA metadata:
store %ptr1 # foo:bar store %ptr1 !mmra !{!"foo", !"bar"}
These two notations can be used in this document and are strictly equivalent. However, only the second version is functional.
- compatibility
Two sets of tags are said to be compatible iff, for every unique tag prefix P present in at least one set:
the other set contains no tag with prefix P, or
at least one tag with prefix P is common to both sets.
The above definition implies that an empty set is always compatible with any other set. This is an important property as it ensures that if a transform drops the metadata on an operation, it can never affect correctness. In other words, the memory model cannot be relaxed further by deleting metadata from instructions.
The happens-before Relation¶
Compatibility checks can be used to opt out of the happens-before relation established between two instructions.
- Ordering
When two instructions’ metadata are not compatible, any program order between them are not in happens-before.
For example, consider two tags
foo:bar
andfoo:baz
exposed by a target:A: store %ptr1 # foo:bar B: store %ptr2 # foo:baz X: store atomic release %ptr3 # foo:bar
In the above figure,
A
is compatible withX
, and henceA
happens-beforeX
. ButB
is not compatible withX
, and hence it is not happens-beforeX
.- Synchronization
If an synchronizing operation has one or more tags, then whether it synchronizes-with and participates in the
seq_cst
order with other operations is target dependent.Whether the following example synchronizes with another sequence depends on the target-defined semantics of
foo:bar
andfoo:bux
.fence release # foo:bar store atomic %ptr1 # foo:bux
Examples¶
- Example 1:
A: store ptr addrspace(1) %ptr2 # sync-as:1 vulkan:nonprivate B: store atomic release ptr addrspace(1) %ptr3 # sync-as:0 vulkan:nonprivate
A and B are not ordered relative to each other (no happens-before) because their sets of tags are not compatible.
Note that the
sync-as
value does not have to match theaddrspace
value. e.g. In Example 1, a store-release to a location inaddrspace(1)
wants to only synchronize with operations happening inaddrspace(0)
.- Example 2:
A: store ptr addrspace(1) %ptr2 # sync-as:1 vulkan:nonprivate B: store atomic release ptr addrspace(1) %ptr3 # sync-as:1 vulkan:nonprivate
The ordering of A and B is unaffected because their set of tags are compatible.
Note that A and B may or may not be in happens-before due to other reasons.
- Example 3:
A: store ptr addrspace(1) %ptr2 # sync-as:1 vulkan:nonprivate B: store atomic release ptr addrspace(1) %ptr3 # vulkan:nonprivate
The ordering of A and B is unaffected because their set of tags are compatible.
- Example 4:
A: store ptr addrspace(1) %ptr2 # sync-as:1 B: store atomic release ptr addrspace(1) %ptr3 # sync-as:2
A and B do not have to be ordered relative to each other (no happens-before) because their sets of tags are not compatible.
Use-cases¶
SPIRV NonPrivatePointer
¶
MMRAs can support the SPIRV capability
VulkanMemoryModel
, where synchronizing operations only affect
memory operations that specify NonPrivatePointer
semantics.
The example below is generated from a SPIRV program using the following recipe:
Add
vulkan:nonprivate
to every synchronizing operation.Add
vulkan:nonprivate
to every non-atomic memory operation that is markedNonPrivatePointer
.Add
vulkan:private
to tags of every non-atomic memory operation that is not markedNonPrivatePointer
.
Thread T1:
A: store %ptr1 # vulkan:nonprivate
B: store %ptr2 # vulkan:private
X: store atomic release %ptr3 # vulkan:nonprivate
Thread T2:
Y: load atomic acquire %ptr3 # vulkan:nonprivate
C: load %ptr2 # vulkan:private
D: load %ptr1 # vulkan:nonprivate
Compatibility ensures that operation A
is ordered
relative to X
while operation D
is ordered relative to Y
.
If X
synchronizes with Y
, then A
happens-before D
.
No such relation can be inferred about operations B
and C
.
Note
The Vulkan Memory Model considers all atomic operation non-private.
Whether vulkan:nonprivate
would be specified on atomic operations is
an implementation detail, as an atomic operation is always nonprivate
.
The implementation may choose to be explicit and emit IR with
vulkan:nonprivate
on every atomic operation, or it could choose to
only emit vulkan::private
and assume vulkan:nonprivate
by default.
Operations marked with vulkan:private
effectively opt out of the
happens-before order in a SPIRV program since they are incompatible
with every synchronizing operation. Note that SPIRV operations that
are not marked NonPrivatePointer
are not entirely private to the
thread — they are implicitly synchronized at the start or end of a
thread by the Vulkan system-synchronizes-with relationship. This
example assumes that the target-defined semantics of
vulkan:private
correctly implements this property.
This scheme is general enough to express the interoperability of SPIRV programs with other environments.
Thread T1:
A: store %ptr1 # vulkan:nonprivate
X: store atomic release %ptr2 # vulkan:nonprivate
Thread T2:
Y: load atomic acquire %ptr2 # foo:bar
B: load %ptr1
In the above example, thread T1
originates from a SPIRV program
while thread T2
originates from a non-SPIRV program. Whether X
can synchronize with Y
is target defined. If X
synchronizes
with Y
, then A
happens before B
(because A/X and
Y/B are compatible).
Implementation Example¶
Consider the implementation of SPIRV NonPrivatePointer
on a target
where all memory operations are cached, and the entire cache is
flushed or invalidated at a release
or acquire
respectively. A
possible scheme is that when translating a SPIRV program, memory
operations marked NonPrivatePointer
should not be cached, and the
cache contents should not be touched during an acquire
and
release
operation.
This could be implemented using the tags that share the vulkan:
prefix,
as follows:
For memory operations:
Operations with
vulkan:nonprivate
should bypass the cache.Operations with
vulkan:private
should be cached.Operations that specify neither or both should conservatively bypass the cache to ensure correctness.
For synchronizing operations:
Operations with
vulkan:nonprivate
should not flush or invalidate the cache.Operations with
vulkan:private
should flush or invalidate the cache.Operations that specify neither or both should conservatively flush or invalidate the cache to ensure correctness.
Note
In such an implementation, dropping the metadata on an operation, while not affecting correctness, may have big performance implications. e.g. an operation bypasses the cache when it shouldn’t.
Memory Types¶
MMRAs may express the selective synchronization of different memory types.
As an example, a target may expose an sync-as:<N>
tag to
pass information about which address spaces are synchronized by the
execution of a synchronizing operation.
Note
Address spaces are used here as a common example, but this concept can apply for other “memory types”. What “memory types” means here is up to the target.
# let 1 = global address space
# let 3 = local address space
Thread T1:
A: store %ptr1 # sync-as:1
B: store %ptr2 # sync-as:3
X: store atomic release ptr addrspace(0) %ptr3 # sync-as:3
Thread T2:
Y: load atomic acquire ptr addrspace(0) %ptr3 # sync-as:3
C: load %ptr2 # sync-as:3
D: load %ptr1 # sync-as:1
In the above figure, X
and Y
are atomic operations on a
location in the global
address space. If X
synchronizes with
Y
, then B
happens-before C
in the local
address
space. But no such statement can be made about operations A
and
D
, although they are peformed on a location in the global
address space.
Implementation Example: Adding Address Space Information to Fences¶
Languages such as OpenCL C provide fence operations such as
atomic_work_item_fence
that can take an explicit address
space to fence.
By default, LLVM has no means to carry that information in the IR, so the information is lost during lowering to LLVM IR. This means that targets such as AMDGPU have to conservatively emit instructions to fence all address spaces in all cases, which can have a noticeable performance impact in high-performance applications.
MMRAs may be used to preserve that information at the IR level, all the
way through code generation. For example, a fence that only affects the
global address space addrspace(1)
may be lowered as
fence release # sync-as:1
and the target may use the presence of sync-as:1
to infer that it
must only emit instruction to fence the global address space.
Note that as MMRAs are opt in, a fence that does not have MMRA metadata could still be lowered conservatively, so this optimization would only apply if the front-end emits the MMRA metadata on the fence instructions.
Additional Topics¶
Note
The following sections are informational.
Performance Impact¶
MMRAs are a way to capture optimization opportunities in the program. But when an operation mentions no tags or conflicting tags, the target may need to produce conservative code to ensure correctness at the cost of performance. This can happen in the following situations:
When a target first introduces MMRAs, the frontend might not have been updated to emit them.
An optimization may drop MMRA metadata.
An optimization may add arbitrary tags to an operation.
Note that targets can always choose to ignore (or even drop) MMRAs and revert to the default behavior/codegen heuristics without affecting correctness.
Consequences of the Absence of happens-before¶
In the happens-before section, we defined how an happens-before relation between two instruction can be broken by leveraging compatibility between MMRAs. When the instructions are incompatible and there is no happens-before relation, we say that the instructions “do not have to be ordered relative to each other”.
“Ordering” in this context is a very broad term which covers both static and runtime aspects.
When there is no ordering constraint, we could statically reorder the instructions in an optimizer transform if the reordering does not break other constraints as single location coherence. Static reordering is one consequence of breaking happens-before, but is not the most interesting one.
Run-time consequences are more interesting. When there is an happens-before relation between instructions, the target has to emit synchronization code to ensure other threads will observe the effects of the instructions in the right order.
For instance, the target may have to wait for previous loads & stores to finish before starting a fence-release, or there may be a need to flush a memory cache before executing the next instruction. In the absence of happens-before, there is no such requirement and no waiting or flushing is required. This may noticeably speed up execution in some cases.
Combining Operations¶
If a pass can combine multiple memory or synchronizing operations into one, it needs to be able to combine MMRAs. One possible way to achieve this is by doing a prefix-wise union of the tag sets.
Let A and B be two tags set, and U be the prefix-wise union of A and B. For every unique tag prefix P present in A or B:
If either A or B has no tags with prefix P, no tags with prefix P are added to U.
If both A and B have at least one tag with prefix P, all tags with prefix P from both sets are added to U.
Passes should avoid aggressively combining MMRAs, as this can result in significant losses of information. While this cannot affect correctness, it may affect performance.
As a general rule of thumb, common passes such as SimplifyCFG that aggressively combine/reorder operations should only combine instructions that have identical sets of tags. Passes that combine less frequently, or that are well aware of the cost of combining the MMRAs can use the prefix-wise union described above.
Examples:
A: store release %ptr1 # foo:x, foo:y, bar:x
B: store release %ptr2 # foo:x, bar:y
# Unique prefixes P = [foo, bar]
# "foo:x" is common to A and B so it's added to U.
# "bar:x" != "bar:y" so it's not added to U.
U: store release %ptr3 # foo:x
A: store release %ptr1 # foo:x, foo:y
B: store release %ptr2 # foo:x, bux:y
# Unique prefixes P = [foo, bux]
# "foo:x" is common to A and B so it's added to U.
# No tags have the prefix "bux" in A.
U: store release %ptr3 # foo:x
A: store release %ptr1
B: store release %ptr2 # foo:x, bar:y
# Unique prefixes P = [foo, bar]
# No tags with "foo" or "bar" in A, so no tags added.
U: store release %ptr3