mirror of
https://github.com/mozilla/gecko-dev.git
synced 2024-12-11 16:32:59 +00:00
42c17aec0b
Differential Revision: https://phabricator.services.mozilla.com/D45818 --HG-- extra : moz-landing-system : lando
371 lines
16 KiB
ReStructuredText
371 lines
16 KiB
ReStructuredText
Advanced Layers
|
|
===============
|
|
|
|
Advanced Layers is a new method of compositing layers in Gecko. This
|
|
document serves as a technical overview and provides a short
|
|
walk-through of its source code.
|
|
|
|
Overview
|
|
--------
|
|
|
|
Advanced Layers attempts to group as many GPU operations as it can into
|
|
a single draw call. This is a common technique in GPU-based rendering
|
|
called “batching”. It is not always trivial, as a batching algorithm can
|
|
easily waste precious CPU resources trying to build optimal draw calls.
|
|
|
|
Advanced Layers reuses the existing Gecko layers system as much as
|
|
possible. Huge layer trees do not currently scale well (see the future
|
|
work section), so opportunities for batching are currently limited
|
|
without expending unnecessary resources elsewhere. However, Advanced
|
|
Layers has a few benefits:
|
|
|
|
- It submits smaller GPU workloads and buffer uploads than the existing
|
|
compositor.
|
|
- It needs only a single pass over the layer tree.
|
|
- It uses occlusion information more intelligently.
|
|
- It is easier to add new specialized rendering paths and new layer
|
|
types.
|
|
- It separates compositing logic from device logic, unlike the existing
|
|
compositor.
|
|
- It is much faster at rendering 3d scenes or complex layer trees.
|
|
- It has experimental code to use the z-buffer for occlusion culling.
|
|
|
|
Because of these benefits we hope that it provides a significant
|
|
improvement over the existing compositor.
|
|
|
|
Advanced Layers uses the acronym “MLG” and “MLGPU” in many places. This
|
|
stands for “Mid-Level Graphics”, the idea being that it is optimized for
|
|
Direct3D 11-style rendering systems as opposed to Direct3D 12 or Vulkan.
|
|
|
|
LayerManagerMLGPU
|
|
-----------------
|
|
|
|
Advanced layers does not change client-side rendering at all. Content
|
|
still uses Direct2D (when possible), and creates identical layer trees
|
|
as it would with a normal Direct3D 11 compositor. In fact, Advanced
|
|
Layers re-uses all of the existing texture handling and video
|
|
infrastructure as well, replacing only the composite-side layer types.
|
|
|
|
Advanced Layers does not create a ``LayerManagerComposite`` - instead,
|
|
it creates a ``LayerManagerMLGPU``. This layer manager does not have a
|
|
``Compositor`` - instead, it has an ``MLGDevice``, which roughly
|
|
abstracts the Direct3D 11 API. (The hope is that this API is easily
|
|
interchangeable for something else when cross-platform or software
|
|
support is needed.)
|
|
|
|
``LayerManagerMLGPU`` also dispenses with the old “composite” layers for
|
|
new layer types. For example, ``ColorLayerComposite`` becomes
|
|
``ColorLayerMLGPU``. Since these layer types implement ``HostLayer``,
|
|
they integrate with ``LayerTransactionParent`` as normal composite
|
|
layers would.
|
|
|
|
Rendering Overview
|
|
------------------
|
|
|
|
The steps for rendering are described in more detail below, but roughly
|
|
the process is:
|
|
|
|
1. Sort layers front-to-back.
|
|
2. Create a dependency tree of render targets (called “views”).
|
|
3. Accumulate draw calls for all layers in each view.
|
|
4. Upload draw call buffers to the GPU.
|
|
5. Execute draw commands for each view.
|
|
|
|
Advanced Layers divides the layer tree into “views”
|
|
(``RenderViewMLGPU``), which correspond to a render target. The root
|
|
layer is represented by a view corresponding to the screen. Layers that
|
|
require intermediate surfaces have temporary views. Layers are analyzed
|
|
front-to-back, and rendered back-to-front within a view. Views
|
|
themselves are rendered front-to-back, to minimize render target
|
|
switching.
|
|
|
|
Each view contains one or more rendering passes (``RenderPassMLGPU``). A
|
|
pass represents a single draw command with one or more rendering items
|
|
attached to it. For example, a ``SolidColorPass`` item contains a
|
|
rectangle and an RGBA value, and many of these can be drawn with a
|
|
single GPU call.
|
|
|
|
When considering a layer, views will first try to find an existing
|
|
rendering batch that can support it. If so, that pass will accumulate
|
|
another draw item for the layer. Otherwise, a new pass will be added.
|
|
|
|
When trying to find a matching pass for a layer, there is a tradeoff in
|
|
CPU time versus the GPU time saved by not issuing another draw commands.
|
|
We generally care more about CPU time, so we do not try too hard in
|
|
matching items to an existing batch.
|
|
|
|
After all layers have been processed, there is a “prepare” step. This
|
|
copies all accumulated draw data and uploads it into vertex and constant
|
|
buffers in the GPU.
|
|
|
|
Finally, we execute rendering commands. At the end of the frame, all
|
|
batches and (most) constant buffers are thrown away.
|
|
|
|
Shaders Overview
|
|
----------------
|
|
|
|
Advanced Layers currently has five layer-related shader pipelines:
|
|
|
|
- Textured (PaintedLayer, ImageLayer, CanvasLayer)
|
|
- ComponentAlpha (PaintedLayer with component-alpha)
|
|
- YCbCr (ImageLayer with YCbCr video)
|
|
- Color (ColorLayers)
|
|
- Blend (ContainerLayers with mix-blend modes)
|
|
|
|
There are also three special shader pipelines:
|
|
|
|
- MaskCombiner, which is used to combine mask layers into a single
|
|
texture.
|
|
- Clear, which is used for fast region-based clears when not directly
|
|
supported by the GPU.
|
|
- Diagnostic, which is used to display the diagnostic overlay texture.
|
|
|
|
The layer shaders follow a unified structure. Each pipeline has a vertex
|
|
and pixel shader. The vertex shader takes a layers ID, a z-buffer depth,
|
|
a unit position in either a unit square or unit triangle, and either
|
|
rectangular or triangular geometry. Shaders can also have ancillary data
|
|
needed like texture coordinates or colors.
|
|
|
|
Most of the time, layers have simple rectangular clips with simple
|
|
rectilinear transforms, and pixel shaders do not need to perform masking
|
|
or clipping. For these layers we use a fast-path pipeline, using
|
|
unit-quad shaders that are able to clip geometry so the pixel shader
|
|
does not have to. This type of pipeline does not support complex masks.
|
|
|
|
If a layer has a complex mask, a rotation or 3d transform, or a complex
|
|
operation like blending, then we use shaders capable of handling
|
|
arbitrary geometry. Their input is a unit triangle, and these shaders
|
|
are generally more expensive.
|
|
|
|
All of the shader-specific data is modelled in ShaderDefinitionsMLGPU.h.
|
|
|
|
CPU Occlusion Culling
|
|
---------------------
|
|
|
|
By default, Advanced Layers performs occlusion culling on the CPU. Since
|
|
layers are visited front-to-back, this is simply a matter of
|
|
accumulating the visible region of opaque layers, and subtracting it
|
|
from the visible region of subsequent layers. There is a major
|
|
difference between this occlusion culling and PostProcessLayers of the
|
|
old compositor: AL performs culling after invalidation, not before.
|
|
Completely valid layers will have an empty visible region.
|
|
|
|
Most layer types (with the exception of images) will intelligently split
|
|
their draw calls into a batch of individual rectangles, based on their
|
|
visible region.
|
|
|
|
Z-Buffering and Occlusion
|
|
-------------------------
|
|
|
|
Advanced Layers also supports occlusion culling on the GPU, using a
|
|
z-buffer. This is disabled by default currently since it is
|
|
significantly costly on integrated GPUs. When using the z-buffer, we
|
|
separate opaque layers into a separate list of passes. The render
|
|
process then uses the following steps:
|
|
|
|
1. The depth buffer is set to read-write.
|
|
2. Opaque batches are executed.,
|
|
3. The depth buffer is set to read-only.
|
|
4. Transparent batches are executed.
|
|
|
|
The problem we have observed is that the depth buffer increases writes
|
|
to the GPU, and on integrated GPUs this is expensive - we have seen draw
|
|
call times increase by 20-30%, which is the wrong direction we want to
|
|
take on battery life. In particular on a full screen video, the call to
|
|
ClearDepthStencilView plus the actual depth buffer write of the video
|
|
can double GPU time.
|
|
|
|
For now the depth-buffer is disabled until we can find a compelling case
|
|
for it on non-integrated hardware.
|
|
|
|
Clipping
|
|
--------
|
|
|
|
Clipping is a bit tricky in Advanced Layers. We cannot use the hardware
|
|
“scissor” feature, since the clip can change from instance to instance
|
|
within a batch. And if using the depth buffer, we cannot write
|
|
transparent pixels for the clipped area. As a result we always clip
|
|
opaque draw rects in the vertex shader (and sometimes even on the CPU,
|
|
as is needed for sane texture coordinates). Only transparent items are
|
|
clipped in the pixel shader. As a result, masked layers and layers with
|
|
non-rectangular transforms are always considered transparent, and use a
|
|
more flexible clipping pipeline.
|
|
|
|
Plane Splitting
|
|
---------------
|
|
|
|
Plane splitting is when a 3D transform causes a layer to be split - for
|
|
example, one transparent layer may intersect another on a separate
|
|
plane. When this happens, Gecko sorts layers using a BSP tree and
|
|
produces a list of triangles instead of draw rects.
|
|
|
|
These layers cannot use the “unit quad” shaders that support the fast
|
|
clipping pipeline. Instead they always use the full triangle-list
|
|
shaders that support extended vertices and clipping.
|
|
|
|
This is the slowest path we can take when building a draw call, since we
|
|
must interact with the polygon clipping and texturing code.
|
|
|
|
Masks
|
|
-----
|
|
|
|
For each layer with a mask attached, Advanced Layers builds a
|
|
``MaskOperation``. These operations must resolve to a single mask
|
|
texture, as well as a rectangular area to which the mask applies. All
|
|
batched pixel shaders will automatically clip pixels to the mask if a
|
|
mask texture is bound. (Note that we must use separate batches if the
|
|
mask texture changes.)
|
|
|
|
Some layers have multiple mask textures. In this case, the MaskOperation
|
|
will store the list of masks, and right before rendering, it will invoke
|
|
a shader to combine these masks into a single texture.
|
|
|
|
MaskOperations are shared across layers when possible, but are not
|
|
cached across frames.
|
|
|
|
BigImage Support
|
|
----------------
|
|
|
|
ImageLayers and CanvasLayers can be tiled with many individual textures.
|
|
This happens in rare cases where the underlying buffer is too big for
|
|
the GPU. Early on this caused problems for Advanced Layers, since AL
|
|
required one texture per layer. We implemented BigImage support by
|
|
creating temporary ImageLayers for each visible tile, and throwing those
|
|
layers away at the end of the frame.
|
|
|
|
Advanced Layers no longer has a 1:1 layer:texture restriction, but we
|
|
retain the temporary layer solution anyway. It is not much code and it
|
|
means we do not have to split ``TexturedLayerMLGPU`` methods into
|
|
iterated and non-iterated versions.
|
|
|
|
Texture Locking
|
|
---------------
|
|
|
|
Advanced Layers has a different texture locking scheme than the existing
|
|
compositor. If a texture needs to be locked, then it is locked by the
|
|
MLGDevice automatically when bound to the current pipeline. The
|
|
MLGDevice keeps a set of the locked textures to avoid double-locking. At
|
|
the end of the frame, any textures in the locked set are unlocked.
|
|
|
|
We cannot easily replicate the locking scheme in the old compositor,
|
|
since the duration of using the texture is not scoped to when we visit
|
|
the layer.
|
|
|
|
Buffer Measurements
|
|
-------------------
|
|
|
|
Advanced Layers uses constant buffers to send layer information and
|
|
extended instance data to the GPU. We do this by pre-allocating large
|
|
constant buffers and mapping them with ``MAP_DISCARD`` at the beginning
|
|
of the frame. Batches may allocate into this up to the maximum bindable
|
|
constant buffer size of the device (currently, 64KB).
|
|
|
|
There are some downsides to this approach. Constant buffers are
|
|
difficult to work with - they have specific alignment requirements, and
|
|
care must be taken not too run over the maximum number of constants in a
|
|
buffer. Another approach would be to store constants in a 2D texture and
|
|
use vertex shader texture fetches. Advanced Layers implemented this and
|
|
benchmarked it to decide which approach to use. Textures seemed to skew
|
|
better on GPU performance, but worse on CPU, but this varied depending
|
|
on the GPU. Overall constant buffers performed best and most
|
|
consistently, so we have kept them.
|
|
|
|
Additionally, we tested different ways of performing buffer uploads.
|
|
Buffer creation itself is costly, especially on integrated GPUs, and
|
|
especially so for immutable, immediate-upload buffers. As a result we
|
|
aggressively cache buffer objects and always allocate them as
|
|
MAP_DISCARD unless they are write-once and long-lived.
|
|
|
|
Buffer Types
|
|
------------
|
|
|
|
Advanced Layers has a few different classes to help build and upload
|
|
buffers to the GPU. They are:
|
|
|
|
- ``MLGBuffer``. This is the low-level shader resource that
|
|
``MLGDevice`` exposes. It is the building block for buffer helper
|
|
classes, but it can also be used to make one-off, immutable,
|
|
immediate-upload buffers. MLGBuffers, being a GPU resource, are
|
|
reference counted.
|
|
- ``SharedBufferMLGPU``. These are large, pre-allocated buffers that
|
|
are read-only on the GPU and write-only on the CPU. They usually
|
|
exceed the maximum bindable buffer size. There are three shared
|
|
buffers created by default and they are automatically unmapped as
|
|
needed: one for vertices, one for vertex shader constants, and one
|
|
for pixel shader constants. When callers allocate into a shared
|
|
buffer they get back a mapped pointer, a GPU resource, and an offset.
|
|
When the underlying device supports offsetable buffers (like
|
|
``ID3D11DeviceContext1`` does), this results in better GPU
|
|
utilization, as there are less resources and fewer upload commands.
|
|
- ``ConstantBufferSection`` and ``VertexBufferSection``. These are
|
|
“views” into a ``SharedBufferMLGPU``. They contain the underlying
|
|
``MLGBuffer``, and when offsetting is supported, the offset
|
|
information necessary for resource binding. Sections are not
|
|
reference counted.
|
|
- ``StagingBuffer``. A dynamically sized CPU buffer where items can be
|
|
appended in a free-form manner. The stride of a single “item” is
|
|
computed by the first item written, and successive items must have
|
|
the same stride. The buffer must be uploaded to the GPU manually.
|
|
Staging buffers are appropriate for creating general constant or
|
|
vertex buffer data. They can also write items in reverse, which is
|
|
how we render back-to-front when layers are visited front-to-back.
|
|
They can be uploaded to a ``SharedBufferMLGPU`` or an immutabler
|
|
``MLGBuffer`` very easily. Staging buffers are not reference counted.
|
|
|
|
Unsupported Features
|
|
--------------------
|
|
|
|
Currently, these features of the old compositor are not yet implemented.
|
|
|
|
- OpenGL and software support (currently AL only works on D3D11).
|
|
- APZ displayport overlay.
|
|
- Diagnostic/developer overlays other than the FPS/timing overlay.
|
|
- DEAA. It was never ported to the D3D11 compositor, but we would like
|
|
it.
|
|
- Component alpha when used inside an opaque intermediate surface.
|
|
- Effects prefs. Possibly not needed post-B2G removal.
|
|
- Widget overlays and underlays used by macOS and Android.
|
|
- DefaultClearColor. This is Android specific, but is easy to added
|
|
when needed.
|
|
- Frame uniformity info in the profiler. Possibly not needed post-B2G
|
|
removal.
|
|
- LayerScope. There are no plans to make this work.
|
|
|
|
Future Work
|
|
-----------
|
|
|
|
- Refactor for D3D12/Vulkan support (namely, split MLGDevice into
|
|
something less stateful and something else more low-level).
|
|
- Remove “MLG” moniker and namespace everything.
|
|
- Other backends (D3D12/Vulkan, OpenGL, Software)
|
|
- Delete CompositorD3D11
|
|
- Add DEAA support
|
|
- Re-enable the depth buffer by default for fast GPUs
|
|
- Re-enable right-sizing of inaccurately sized containers
|
|
- Drop constant buffers for ancillary vertex data
|
|
- Fast shader paths for simple video/painted layer cases
|
|
|
|
History
|
|
-------
|
|
|
|
Advanced Layers has gone through four major design iterations. The
|
|
initial version used tiling - each render view divided the screen into
|
|
128x128 tiles, and layers were assigned to tiles based on their
|
|
screen-space draw area. This approach proved not to scale well to 3d
|
|
transforms, and so tiling was eliminated.
|
|
|
|
We replaced it with a simple system of accumulating draw regions to each
|
|
batch, thus ensuring that items could be assigned to batches while
|
|
maintaining correct z-ordering. This second iteration also coincided
|
|
with plane-splitting support.
|
|
|
|
On large layer trees, accumulating the affected regions of batches
|
|
proved to be quite expensive. This led to a third iteration, using depth
|
|
buffers and separate opaque and transparent batch lists to achieve
|
|
z-ordering and occlusion culling.
|
|
|
|
Finally, depth buffers proved to be too expensive, and we introduced a
|
|
simple CPU-based occlusion culling pass. This iteration coincided with
|
|
using more precise draw rects and splitting pipelines into unit-quad,
|
|
cpu-clipped and triangle-list, gpu-clipped variants.
|