diff --git a/gfx/doc/AdvancedLayers.md b/gfx/doc/AdvancedLayers.md deleted file mode 100644 index 6bfb7c50dbbb..000000000000 --- a/gfx/doc/AdvancedLayers.md +++ /dev/null @@ -1,308 +0,0 @@ -Advanced Layers -============== - -Advanced Layers is a new method of compositing layers in Gecko. This document serves as a technical -overview and provides a short walk-through of its source code. - -Overview -------------- - -Advanced Layers attempts to group as many GPU operations as it can into a single draw call. This is -a common technique in GPU-based rendering called "batching". It is not always trivial, as a -batching algorithm can easily waste precious CPU resources trying to build optimal draw calls. - -Advanced Layers reuses the existing Gecko layers system as much as possible. Huge layer trees do -not currently scale well (see the future work section), so opportunities for batching are currently -limited without expending unnecessary resources elsewhere. However, Advanced Layers has a few -benefits: - - * It submits smaller GPU workloads and buffer uploads than the existing compositor. - * It needs only a single pass over the layer tree. - * It uses occlusion information more intelligently. - * It is easier to add new specialized rendering paths and new layer types. - * It separates compositing logic from device logic, unlike the existing compositor. - * It is much faster at rendering 3d scenes or complex layer trees. - * It has experimental code to use the z-buffer for occlusion culling. - -Because of these benefits we hope that it provides a significant improvement over the existing -compositor. - -Advanced Layers uses the acronym "MLG" and "MLGPU" in many places. This stands for "Mid-Level -Graphics", the idea being that it is optimized for Direct3D 11-style rendering systems as opposed -to Direct3D 12 or Vulkan. - -LayerManagerMLGPU ------------------------------- - -Advanced layers does not change client-side rendering at all. Content still uses Direct2D (when -possible), and creates identical layer trees as it would with a normal Direct3D 11 compositor. In -fact, Advanced Layers re-uses all of the existing texture handling and video infrastructure as -well, replacing only the composite-side layer types. - -Advanced Layers does not create a `LayerManagerComposite` - instead, it creates a -`LayerManagerMLGPU`. This layer manager does not have a `Compositor` - instead, it has an -`MLGDevice`, which roughly abstracts the Direct3D 11 API. (The hope is that this API is easily -interchangeable for something else when cross-platform or software support is needed.) - -`LayerManagerMLGPU` also dispenses with the old "composite" layers for new layer types. For -example, `ColorLayerComposite` becomes `ColorLayerMLGPU`. Since these layer types implement -`HostLayer`, they integrate with `LayerTransactionParent` as normal composite layers would. - -Rendering Overview ----------------------------- - -The steps for rendering are described in more detail below, but roughly the process is: - -1. Sort layers front-to-back. -2. Create a dependency tree of render targets (called "views"). -3. Accumulate draw calls for all layers in each view. -4. Upload draw call buffers to the GPU. -5. Execute draw commands for each view. - -Advanced Layers divides the layer tree into "views" (`RenderViewMLGPU`), which correspond to a -render target. The root layer is represented by a view corresponding to the screen. Layers that -require intermediate surfaces have temporary views. Layers are analyzed front-to-back, and rendered -back-to-front within a view. Views themselves are rendered front-to-back, to minimize render target -switching. - -Each view contains one or more rendering passes (`RenderPassMLGPU`). A pass represents a single -draw command with one or more rendering items attached to it. For example, a `SolidColorPass` item -contains a rectangle and an RGBA value, and many of these can be drawn with a single GPU call. - -When considering a layer, views will first try to find an existing rendering batch that can support -it. If so, that pass will accumulate another draw item for the layer. Otherwise, a new pass will be -added. - -When trying to find a matching pass for a layer, there is a tradeoff in CPU time versus the GPU -time saved by not issuing another draw commands. We generally care more about CPU time, so we do -not try too hard in matching items to an existing batch. - -After all layers have been processed, there is a "prepare" step. This copies all accumulated draw -data and uploads it into vertex and constant buffers in the GPU. - -Finally, we execute rendering commands. At the end of the frame, all batches and (most) constant -buffers are thrown away. - -Shaders Overview -------------------------------------- - -Advanced Layers currently has five layer-related shader pipelines: - - - Textured (PaintedLayer, ImageLayer, CanvasLayer) - - ComponentAlpha (PaintedLayer with component-alpha) - - YCbCr (ImageLayer with YCbCr video) - - Color (ColorLayers) - - Blend (ContainerLayers with mix-blend modes) - -There are also three special shader pipelines: - - - MaskCombiner, which is used to combine mask layers into a single texture. - - Clear, which is used for fast region-based clears when not directly supported by the GPU. - - Diagnostic, which is used to display the diagnostic overlay texture. - -The layer shaders follow a unified structure. Each pipeline has a vertex and pixel shader. -The vertex shader takes a layers ID, a z-buffer depth, a unit position in either a unit -square or unit triangle, and either rectangular or triangular geometry. Shaders can also -have ancillary data needed like texture coordinates or colors. - -Most of the time, layers have simple rectangular clips with simple rectilinear transforms, and -pixel shaders do not need to perform masking or clipping. For these layers we use a fast-path -pipeline, using unit-quad shaders that are able to clip geometry so the pixel shader does not -have to. This type of pipeline does not support complex masks. - -If a layer has a complex mask, a rotation or 3d transform, or a complex operation like blending, -then we use shaders capable of handling arbitrary geometry. Their input is a unit triangle, -and these shaders are generally more expensive. - -All of the shader-specific data is modelled in ShaderDefinitionsMLGPU.h. - -CPU Occlusion Culling -------------------------------------- - -By default, Advanced Layers performs occlusion culling on the CPU. Since layers are visited -front-to-back, this is simply a matter of accumulating the visible region of opaque layers, and -subtracting it from the visible region of subsequent layers. There is a major difference -between this occlusion culling and PostProcessLayers of the old compositor: AL performs culling -after invalidation, not before. Completely valid layers will have an empty visible region. - -Most layer types (with the exception of images) will intelligently split their draw calls into a -batch of individual rectangles, based on their visible region. - -Z-Buffering and Occlusion -------------------------------------- - -Advanced Layers also supports occlusion culling on the GPU, using a z-buffer. This is disabled by -default currently since it is significantly costly on integrated GPUs. When using the z-buffer, we -separate opaque layers into a separate list of passes. The render process then uses the following -steps: - - 1. The depth buffer is set to read-write. - 2. Opaque batches are executed., - 3. The depth buffer is set to read-only. - 4. Transparent batches are executed. - -The problem we have observed is that the depth buffer increases writes to the GPU, and on -integrated GPUs this is expensive - we have seen draw call times increase by 20-30%, which is the -wrong direction we want to take on battery life. In particular on a full screen video, the call to -ClearDepthStencilView plus the actual depth buffer write of the video can double GPU time. - -For now the depth-buffer is disabled until we can find a compelling case for it on non-integrated -hardware. - -Clipping -------------------------------------- - -Clipping is a bit tricky in Advanced Layers. We cannot use the hardware "scissor" feature, since the -clip can change from instance to instance within a batch. And if using the depth buffer, we -cannot write transparent pixels for the clipped area. As a result we always clip opaque draw rects -in the vertex shader (and sometimes even on the CPU, as is needed for sane texture coordiantes). -Only transparent items are clipped in the pixel shader. As a result, masked layers and layers with -non-rectangular transforms are always considered transparent, and use a more flexible clipping -pipeline. - -Plane Splitting ---------------------- - -Plane splitting is when a 3D transform causes a layer to be split - for example, one transparent -layer may intersect another on a separate plane. When this happens, Gecko sorts layers using a BSP -tree and produces a list of triangles instead of draw rects. - -These layers cannot use the "unit quad" shaders that support the fast clipping pipeline. Instead -they always use the full triangle-list shaders that support extended vertices and clipping. - -This is the slowest path we can take when building a draw call, since we must interact with the -polygon clipping and texturing code. - -Masks ---------- - -For each layer with a mask attached, Advanced Layers builds a `MaskOperation`. These operations -must resolve to a single mask texture, as well as a rectangular area to which the mask applies. All -batched pixel shaders will automatically clip pixels to the mask if a mask texture is bound. (Note -that we must use separate batches if the mask texture changes.) - -Some layers have multiple mask textures. In this case, the MaskOperation will store the list of -masks, and right before rendering, it will invoke a shader to combine these masks into a single texture. - -MaskOperations are shared across layers when possible, but are not cached across frames. - -BigImage Support --------------------------- - -ImageLayers and CanvasLayers can be tiled with many individual textures. This happens in rare cases -where the underlying buffer is too big for the GPU. Early on this caused problems for Advanced -Layers, since AL required one texture per layer. We implemented BigImage support by creating -temporary ImageLayers for each visible tile, and throwing those layers away at the end of the -frame. - -Advanced Layers no longer has a 1:1 layer:texture restriction, but we retain the temporary layer -solution anyway. It is not much code and it means we do not have to split `TexturedLayerMLGPU` -methods into iterated and non-iterated versions. - -Texture Locking ----------------------- - -Advanced Layers has a different texture locking scheme than the existing compositor. If a texture -needs to be locked, then it is locked by the MLGDevice automatically when bound to the current -pipeline. The MLGDevice keeps a set of the locked textures to avoid double-locking. At the end of -the frame, any textures in the locked set are unlocked. - -We cannot easily replicate the locking scheme in the old compositor, since the duration of using -the texture is not scoped to when we visit the layer. - -Buffer Measurements -------------------------------- - -Advanced Layers uses constant buffers to send layer information and extended instance data to the -GPU. We do this by pre-allocating large constant buffers and mapping them with `MAP_DISCARD` at the -beginning of the frame. Batches may allocate into this up to the maximum bindable constant buffer -size of the device (currently, 64KB). - -There are some downsides to this approach. Constant buffers are difficult to work with - they have -specific alignment requirements, and care must be taken not too run over the maximum number of -constants in a buffer. Another approach would be to store constants in a 2D texture and use vertex -shader texture fetches. Advanced Layers implemented this and benchmarked it to decide which -approach to use. Textures seemed to skew better on GPU performance, but worse on CPU, but this -varied depending on the GPU. Overall constant buffers performed best and most consistently, so we -have kept them. - -Additionally, we tested different ways of performing buffer uploads. Buffer creation itself is -costly, especially on integrated GPUs, and especially so for immutable, immediate-upload buffers. -As a result we aggressively cache buffer objects and always allocate them as MAP_DISCARD unless -they are write-once and long-lived. - -Buffer Types ------------- - -Advanced Layers has a few different classes to help build and upload buffers to the GPU. They are: - - - `MLGBuffer`. This is the low-level shader resource that `MLGDevice` exposes. It is the building - block for buffer helper classes, but it can also be used to make one-off, immutable, - immediate-upload buffers. MLGBuffers, being a GPU resource, are reference counted. - - `SharedBufferMLGPU`. These are large, pre-allocated buffers that are read-only on the GPU and - write-only on the CPU. They usually exceed the maximum bindable buffer size. There are three - shared buffers created by default and they are automatically unmapped as needed: one for vertices, - one for vertex shader constants, and one for pixel shader constants. When callers allocate into a - shared buffer they get back a mapped pointer, a GPU resource, and an offset. When the underlying - device supports offsetable buffers (like `ID3D11DeviceContext1` does), this results in better GPU - utilization, as there are less resources and fewer upload commands. - - `ConstantBufferSection` and `VertexBufferSection`. These are "views" into a `SharedBufferMLGPU`. - They contain the underlying `MLGBuffer`, and when offsetting is supported, the offset - information necessary for resource binding. Sections are not reference counted. - - `StagingBuffer`. A dynamically sized CPU buffer where items can be appended in a free-form - manner. The stride of a single "item" is computed by the first item written, and successive - items must have the same stride. The buffer must be uploaded to the GPU manually. Staging buffers - are appropriate for creating general constant or vertex buffer data. They can also write items in - reverse, which is how we render back-to-front when layers are visited front-to-back. They can be - uploaded to a `SharedBufferMLGPU` or an immutabler `MLGBuffer` very easily. Staging buffers are not - reference counted. - -Unsupported Features --------------------------------- - -Currently, these features of the old compositor are not yet implemented. - - - OpenGL and software support (currently AL only works on D3D11). - - APZ displayport overlay. - - Diagnostic/developer overlays other than the FPS/timing overlay. - - DEAA. It was never ported to the D3D11 compositor, but we would like it. - - Component alpha when used inside an opaque intermediate surface. - - Effects prefs. Possibly not needed post-B2G removal. - - Widget overlays and underlays used by macOS and Android. - - DefaultClearColor. This is Android specific, but is easy to added when needed. - - Frame uniformity info in the profiler. Possibly not needed post-B2G removal. - - LayerScope. There are no plans to make this work. - -Future Work --------------------------------- - - - Refactor for D3D12/Vulkan support (namely, split MLGDevice into something less stateful and something else more low-level). - - Remove "MLG" moniker and namespace everything. - - Other backends (D3D12/Vulkan, OpenGL, Software) - - Delete CompositorD3D11 - - Add DEAA support - - Re-enable the depth buffer by default for fast GPUs - - Re-enable right-sizing of inaccurately sized containers - - Drop constant buffers for ancillary vertex data - - Fast shader paths for simple video/painted layer cases - -History ----------- - -Advanced Layers has gone through four major design iterations. The initial version used tiling - -each render view divided the screen into 128x128 tiles, and layers were assigned to tiles based on -their screen-space draw area. This approach proved not to scale well to 3d transforms, and so -tiling was eliminated. - -We replaced it with a simple system of accumulating draw regions to each batch, thus ensuring that -items could be assigned to batches while maintaining correct z-ordering. This second iteration also -coincided with plane-splitting support. - -On large layer trees, accumulating the affected regions of batches proved to be quite expensive. -This led to a third iteration, using depth buffers and separate opaque and transparent batch lists -to achieve z-ordering and occlusion culling. - -Finally, depth buffers proved to be too expensive, and we introduced a simple CPU-based occlusion -culling pass. This iteration coincided with using more precise draw rects and splitting pipelines -into unit-quad, cpu-clipped and triangle-list, gpu-clipped variants. - diff --git a/gfx/doc/AsyncPanZoom.md b/gfx/doc/AsyncPanZoom.md deleted file mode 100644 index 1fc58e03d3f4..000000000000 --- a/gfx/doc/AsyncPanZoom.md +++ /dev/null @@ -1,299 +0,0 @@ -Asynchronous Panning and Zooming {#apz} -================================ - -**This document is a work in progress. Some information may be missing or incomplete.** - -## Goals - -We need to be able to provide a visual response to user input with minimal latency. -In particular, on devices with touch input, content must track the finger exactly while panning, or the user experience is very poor. -According to the UX team, 120ms is an acceptable latency between user input and response. - -## Context and surrounding architecture - -The fundamental problem we are trying to solve with the Asynchronous Panning and Zooming (APZ) code is that of responsiveness. -By default, web browsers operate in a "game loop" that looks like this: - - while true: - process input - do computations - repaint content - display repainted content - -In browsers the "do computation" step can be arbitrarily expensive because it can involve running event handlers in web content. -Therefore, there can be an arbitrary delay between the input being received and the on-screen display getting updated. - -Responsiveness is always good, and with touch-based interaction it is even more important than with mouse or keyboard input. -In order to ensure responsiveness, we split the "game loop" model of the browser into a multithreaded variant which looks something like this: - - Thread 1 (compositor thread) - while true: - receive input - send a copy of input to thread 2 - adjust painted content based on input - display adjusted painted content - - Thread 2 (main thread) - while true: - receive input from thread 1 - do computations - repaint content - update the copy of painted content in thread 1 - -This multithreaded model is called off-main-thread compositing (OMTC), because the compositing (where the content is displayed on-screen) happens on a separate thread from the main thread. -Note that this is a very very simplified model, but in this model the "adjust painted content based on input" is the primary function of the APZ code. - -The "painted content" is stored on a set of "layers", that are conceptually double-buffered. -That is, when the main thread does its repaint, it paints into one set of layers (the "client" layers). -The update that is sent to the compositor thread copies all the changes from the client layers into another set of layers that the compositor holds. -These layers are called the "shadow" layers or the "compositor" layers. -The compositor in theory can continuously composite these shadow layers to the screen while the main thread is busy doing other things and painting a new set of client layers. - -The APZ code takes the input events that are coming in from the hardware and uses them to figure out what the user is trying to do (e.g. pan the page, zoom in). -It then expresses this user intention in the form of translation and/or scale transformation matrices. -These transformation matrices are applied to the shadow layers at composite time, so that what the user sees on-screen reflects what they are trying to do as closely as possible. - -## Technical overview - -As per the heavily simplified model described above, the fundamental purpose of the APZ code is to take input events and produce transformation matrices. -This section attempts to break that down and identify the different problems that make this task non-trivial. - -### Checkerboarding - -The content area that is painted and stored in a shadow layer is called the "displayport". -The APZ code is responsible for determining how large the displayport should be. -On the one hand, we want the displayport to be as large as possible. -At the very least it needs to be larger than what is visible on-screen, because otherwise, as soon as the user pans, there will be some unpainted area of the page exposed. -However, we cannot always set the displayport to be the entire page, because the page can be arbitrarily long and this would require an unbounded amount of memory to store. -Therefore, a good displayport size is one that is larger than the visible area but not so large that it is a huge drain on memory. -Because the displayport is usually smaller than the whole page, it is always possible for the user to scroll so fast that they end up in an area of the page outside the displayport. -When this happens, they see unpainted content; this is referred to as "checkerboarding", and we try to avoid it where possible. - -There are many possible ways to determine what the displayport should be in order to balance the tradeoffs involved (i.e. having one that is too big is bad for memory usage, and having one that is too small results in excessive checkerboarding). -Ideally, the displayport should cover exactly the area that we know the user will make visible. -Although we cannot know this for sure, we can use heuristics based on current panning velocity and direction to ensure a reasonably-chosen displayport area. -This calculation is done in the APZ code, and a new desired displayport is frequently sent to the main thread as the user is panning around. - -### Multiple layers - -Consider, for example, a scrollable page that contains an iframe which itself is scrollable. -The iframe can be scrolled independently of the top-level page, and we would like both the page and the iframe to scroll responsively. -This means that we want independent asynchronous panning for both the top-level page and the iframe. -In addition to iframes, elements that have the overflow:scroll CSS property set are also scrollable, and also end up on separate scrollable layers. -In the general case, the layers are arranged in a tree structure, and so within the APZ code we have a matching tree of AsyncPanZoomController (APZC) objects, one for each scrollable layer. -To manage this tree of APZC instances, we have a single APZCTreeManager object. -Each APZC is relatively independent and handles the scrolling for its associated layer, but there are some cases in which they need to interact; these cases are described in the sections below. - -### Hit detection - -Consider again the case where we have a scrollable page that contains an iframe which itself is scrollable. -As described above, we will have two APZC instances - one for the page and one for the iframe. -When the user puts their finger down on the screen and moves it, we need to do some sort of hit detection in order to determine whether their finger is on the iframe or on the top-level page. -Based on where their finger lands, the appropriate APZC instance needs to handle the input. -This hit detection is also done in the APZCTreeManager, as it has the necessary information about the sizes and positions of the layers. -Currently this hit detection is not perfect, as it uses rects and does not account for things like rounded corners and opacity. - -Also note that for some types of input (e.g. when the user puts two fingers down to do a pinch) we do not want the input to be "split" across two different APZC instances. -In the case of a pinch, for example, we find a "common ancestor" APZC instance - one that is zoomable and contains all of the touch input points, and direct the input to that APZC instance. - -### Scroll Handoff - -Consider yet again the case where we have a scrollable page that contains an iframe which itself is scrollable. -Say the user scrolls the iframe so that it reaches the bottom. -If the user continues panning on the iframe, the expectation is that the top-level page will start scrolling. -However, as discussed in the section on hit detection, the APZC instance for the iframe is separate from the APZC instance for the top-level page. -Thus, we need the two APZC instances to communicate in some way such that input events on the iframe result in scrolling on the top-level page. -This behaviour is referred to as "scroll handoff" (or "fling handoff" in the case where analogous behaviour results from the scrolling momentum of the page after the user has lifted their finger). - -### Input event untransformation - -The APZC architecture by definition results in two copies of a "scroll position" for each scrollable layer. -There is the original copy on the main thread that is accessible to web content and the layout and painting code. -And there is a second copy on the compositor side, which is updated asynchronously based on user input, and corresponds to what the user visually sees on the screen. -Although these two copies may diverge temporarily, they are reconciled periodically. -In particular, they diverge while the APZ code is performing an async pan or zoom action on behalf of the user, and are reconciled when the APZ code requests a repaint from the main thread. - -Because of the way input events are stored, this has some unfortunate consequences. -Input events are stored relative to the device screen - so if the user touches at the same physical spot on the device, the same input events will be delivered regardless of the content scroll position. -When the main thread receives a touch event, it combines that with the content scroll position in order to figure out what DOM element the user touched. -However, because we now have two different scroll positions, this process may not work perfectly. -A concrete example follows: - -Consider a device with screen size 600 pixels tall. -On this device, a user is viewing a document that is 1000 pixels tall, and that is scrolled down by 200 pixels. -That is, the vertical section of the document from 200px to 800px is visible. -Now, if the user touches a point 100px from the top of the physical display, the hardware will generate a touch event with y=100. -This will get sent to the main thread, which will add the scroll position (200) and get a document-relative touch event with y=300. -This new y-value will be used in hit detection to figure out what the user touched. -If the document had a absolute-positioned div at y=300, then that would receive the touch event. - -Now let us add some async scrolling to this example. -Say that the user additionally scrolls the document by another 10 pixels asynchronously (i.e. only on the compositor thread), and then does the same touch event. -The same input event is generated by the hardware, and as before, the document will deliver the touch event to the div at y=300. -However, visually, the document is scrolled by an additional 10 pixels so this outcome is wrong. -What needs to happen is that the APZ code needs to intercept the touch event and account for the 10 pixels of asynchronous scroll. -Therefore, the input event with y=100 gets converted to y=110 in the APZ code before being passed on to the main thread. -The main thread then adds the scroll position it knows about and determines that the user touched at a document-relative position of y=310. - -Analogous input event transformations need to be done for horizontal scrolling and zooming. - -### Content independently adjusting scrolling - -As described above, there are two copies of the scroll position in the APZ architecture - one on the main thread and one on the compositor thread. -Usually for architectures like this, there is a single "source of truth" value and the other value is simply a copy. -However, in this case that is not easily possible to do. -The reason is that both of these values can be legitimately modified. -On the compositor side, the input events the user is triggering modify the scroll position, which is then propagated to the main thread. -However, on the main thread, web content might be running Javascript code that programatically sets the scroll position (via window.scrollTo, for example). -Scroll changes driven from the main thread are just as legitimate and need to be propagated to the compositor thread, so that the visual display updates in response. - -Because the cross-thread messaging is asynchronous, reconciling the two types of scroll changes is a tricky problem. -Our design solves this using various flags and generation counters. -The general heuristic we have is that content-driven scroll position changes (e.g. scrollTo from JS) are never lost. -For instance, if the user is doing an async scroll with their finger and content does a scrollTo in the middle, then some of the async scroll would occur before the "jump" and the rest after the "jump". - -### Content preventing default behaviour of input events - -Another problem that we need to deal with is that web content is allowed to intercept touch events and prevent the "default behaviour" of scrolling. -This ability is defined in web standards and is non-negotiable. -Touch event listeners in web content are allowed call preventDefault() on the touchstart or first touchmove event for a touch point; doing this is supposed to "consume" the event and prevent touch-based panning. -As we saw in a previous section, the input event needs to be untransformed by the APZ code before it can be delivered to content. -But, because of the preventDefault problem, we cannot fully process the touch event in the APZ code until content has had a chance to handle it. -Web browsers in general solve this problem by inserting a delay of up to 300ms before processing the input - that is, web content is allowed up to 300ms to process the event and call preventDefault on it. -If web content takes longer than 300ms, or if it completes handling of the event without calling preventDefault, then the browser immediately starts processing the events. - -The way the APZ implementation deals with this is that upon receiving a touch event, it immediately returns an untransformed version that can be dispatched to content. -It also schedules a 400ms timeout (600ms on Android) during which content is allowed to prevent scrolling. -There is an API that allows the main-thread event dispatching code to notify the APZ as to whether or not the default action should be prevented. -If the APZ content response timeout expires, or if the main-thread event dispatching code notifies the APZ of the preventDefault status, then the APZ continues with the processing of the events (which may involve discarding the events). - -The touch-action CSS property from the pointer-events spec is intended to allow eliminating this 400ms delay in many cases (although for backwards compatibility it will still be needed for a while). -Note that even with touch-action implemented, there may be cases where the APZ code does not know the touch-action behaviour of the point the user touched. -In such cases, the APZ code will still wait up to 400ms for the main thread to provide it with the touch-action behaviour information. - -## Technical details - -This section describes various pieces of the APZ code, and goes into more specific detail on APIs and code than the previous sections. -The primary purpose of this section is to help people who plan on making changes to the code, while also not going into so much detail that it needs to be updated with every patch. - -### Overall flow of input events - -This section describes how input events flow through the APZ code. -
    -
  1. -Input events arrive from the hardware/widget code into the APZ via APZCTreeManager::ReceiveInputEvent. -The thread that invokes this is called the input thread, and may or may not be the same as the Gecko main thread. -
  2. -
  3. -Conceptually the first thing that the APZCTreeManager does is to associate these events with "input blocks". -An input block is a set of events that share certain properties, and generally are intended to represent a single gesture. -For example with touch events, all events following a touchstart up to but not including the next touchstart are in the same block. -All of the events in a given block will go to the same APZC instance and will either all be processed or all be dropped. -
  4. -
  5. -Using the first event in the input block, the APZCTreeManager does a hit-test to see which APZC it hits. -This hit-test uses the event regions populated on the layers, which may be larger than the true hit area of the layer. -If no APZC is hit, the events are discarded and we jump to step 6. -Otherwise, the input block is tagged with the hit APZC as a tentative target and put into a global APZ input queue. -
  6. -
  7. -
      -
    1. - If the input events landed outside the dispatch-to-content event region for the layer, any available events in the input block are processed. - These may trigger behaviours like scrolling or tap gestures. -
    2. -
    3. - If the input events landed inside the dispatch-to-content event region for the layer, the events are left in the queue and a 400ms timeout is initiated. - If the timeout expires before step 9 is completed, the APZ assumes the input block was not cancelled and the tentative target is correct, and processes them as part of step 10. -
    4. -
    -
  8. -
  9. -The call stack unwinds back to APZCTreeManager::ReceiveInputEvent, which does an in-place modification of the input event so that any async transforms are removed. -
  10. -
  11. -The call stack unwinds back to the widget code that called ReceiveInputEvent. -This code now has the event in the coordinate space Gecko is expecting, and so can dispatch it to the Gecko main thread. -
  12. -
  13. -Gecko performs its own usual hit-testing and event dispatching for the event. -As part of this, it records whether any touch listeners cancelled the input block by calling preventDefault(). -It also activates inactive scrollframes that were hit by the input events. -
  14. -
  15. -The call stack unwinds back to the widget code, which sends two notifications to the APZ code on the input thread. -The first notification is via APZCTreeManager::ContentReceivedInputBlock, and informs the APZ whether the input block was cancelled. -The second notification is via APZCTreeManager::SetTargetAPZC, and informs the APZ of the results of the Gecko hit-test during event dispatch. -Note that Gecko may report that the input event did not hit any scrollable frame at all. -The SetTargetAPZC notification happens only once per input block, while the ContentReceivedInputBlock notification may happen once per block, or multiple times per block, depending on the input type. -
  16. -
  17. -
      -
    1. - If the events were processed as part of step 4(i), the notifications from step 8 are ignored and step 10 is skipped. -
    2. -
    3. - If events were queued as part of step 4(ii), and steps 5-8 take less than 400ms, the arrival of both notifications from step 8 will mark the input block ready for processing. -
    4. -
    5. - If events were queued as part of step 4(ii), but steps 5-8 take longer than 400ms, the notifications from step 8 will be ignored and step 10 will already have happened. -
    6. -
    -
  18. -
  19. -If events were queued as part of step 4(ii) they are now either processed (if the input block was not cancelled and Gecko detected a scrollframe under the input event, or if the timeout expired) or dropped (all other cases). -Note that the APZC that processes the events may be different at this step than the tentative target from step 3, depending on the SetTargetAPZC notification. -Processing the events may trigger behaviours like scrolling or tap gestures. -
  20. -
- -If the CSS touch-action property is enabled, the above steps are modified as follows: - - -#### Threading considerations - -The bulk of the input processing in the APZ code happens on what we call "the input thread". -In practice the input thread could be the Gecko main thread, the compositor thread, or some other thread. -There are obvious downsides to using the Gecko main thread - that is, "asynchronous" panning and zooming is not really asynchronous as input events can only be processed while Gecko is idle. -In an e10s environment, using the Gecko main thread of the chrome process is acceptable, because the code running in that process is more controllable and well-behaved than arbitrary web content. -Using the compositor thread as the input thread could work on some platforms, but may be inefficient on others. -For example, on Android (Fennec) we receive input events from the system on a dedicated UI thread. -We would have to redispatch the input events to the compositor thread if we wanted to the input thread to be the same as the compositor thread. -This introduces a potential for higher latency, particularly if the compositor does any blocking operations - blocking SwapBuffers operations, for example. -As a result, the APZ code itself does not assume that the input thread will be the same as the Gecko main thread or the compositor thread. - -#### Active vs. inactive scrollframes - -The number of scrollframes on a page is potentially unbounded. -However, we do not want to create a separate layer for each scrollframe right away, as this would require large amounts of memory. -Therefore, scrollframes as designated as either "active" or "inactive". -Active scrollframes are the ones that do have their contents put on a separate layer (or set of layers), and inactive ones do not. - -Consider a page with a scrollframe that is initially inactive. -When layout generates the layers for this page, the content of the scrollframe will be flattened into some other PaintedLayer (call it P). -The layout code also adds the area (or bounding region in case of weird shapes) of the scrollframe to the dispatch-to-content region of P. - -When the user starts interacting with that content, the hit-test in the APZ code finds the dispatch-to-content region of P. -The input block therefore has a tentative target of P when it goes into step 4(ii) in the flow above. -When gecko processes the input event, it must detect the inactive scrollframe and activate it, as part of step 7. -Finally, the widget code sends the SetTargetAPZC notification in step 8 to notify the APZ that the input block should really apply to this new layer. -The issue here is that the layer transaction containing the new layer must reach the compositor and APZ before the SetTargetAPZC notification. -If this does not occur within the 400ms timeout, the APZ code will be unable to update the tentative target, and will continue to use P for that input block. -Input blocks that start after the layer transaction will get correctly routed to the new layer as there will now be a layer and APZC instance for the active scrollframe. - -This model implies that when the user initially attempts to scroll an inactive scrollframe, it may end up scrolling an ancestor scrollframe. -(This is because in the absence of the SetTargetAPZC notification, the input events will get applied to the closest ancestor scrollframe's APZC.) -Only after the round-trip to the gecko thread is complete is there a layer for async scrolling to actually occur on the scrollframe itself. -At that point the scrollframe will start receiving new input blocks and will scroll normally. diff --git a/gfx/doc/GraphicsOverview.md b/gfx/doc/GraphicsOverview.md deleted file mode 100644 index b834172c8d7b..000000000000 --- a/gfx/doc/GraphicsOverview.md +++ /dev/null @@ -1,83 +0,0 @@ -Mozilla Graphics Overview {#graphicsoverview} -================= -## Work in progress. Possibly incorrect or incomplete. - -Overview --------- -The graphics systems is responsible for rendering (painting, drawing) the frame tree (rendering tree) elements as created by the layout system. Each leaf in the tree has content, either bounded by a rectangle (or perhaps another shape, in the case of SVG.) - -The simple approach for producing the result would thus involve traversing the frame tree, in a correct order, drawing each frame into the resulting buffer and displaying (printing non-withstanding) that buffer when the traversal is done. It is worth spending some time on the "correct order" note above. If there are no overlapping frames, this is fairly simple - any order will do, as long as there is no background. If there is background, we just have to worry about drawing that first. Since we do not control the content, chances are the page is more complicated. There are overlapping frames, likely with transparency, so we need to make sure the elements are draw "back to front", in layers, so to speak. Layers are an important concept, and we will revisit them shortly, as they are central to fixing a major issue with the above simple approach. - -While the above simple approach will work, the performance will suffer. Each time anything changes in any of the frames, the complete process needs to be repeated, everything needs to be redrawn. Further, there is very little space to take advantage of the modern graphics (GPU) hardware, or multi-core computers. If you recall from the previous sections, the frame tree is only accessible from the UI thread, so while we're doing all this work, the UI is basically blocked. - -### (Retained) Layers - -Layers framework was introduced to address the above performance issues, by having a part of the design address each item. At the high level: - -1. We create a layer tree. The leaf elements of the tree contain all frames (possibly multiple frames per leaf). -2. We render each layer tree element and cache (retain) the result. -3. We composite (combine) all the leaf elements into the final result. - -Let's examine each of these steps, in reverse order. - -### Compositing -We use the term composite as it implies that the order is important. If the elements being composited overlap, whether there is transparency involved or not, the order in which they are combined will effect the result. -Compositing is where we can use some of the power of the modern graphics hardware. It is optimal for doing this job. In the scenarios where only the position of individual frames changes, without the content inside them changing, we see why caching each layer would be advantageous - we only need to repeat the final compositing step, completely skipping the layer tree creation and the rendering of each leaf, thus speeding up the process considerably. - -Another benefit is equally apparent in the context of the stated deficiencies of the simple approach. We can use the available graphics hardware accelerated APIs to do the compositing step. Direct3D, OpenGL can be used on different platforms and are well suited to accelerate this step. - -Finally, we can now envision performing the compositing step on a separate thread, unblocking the UI thread for other work, and doing more work in parallel. More on this below. - -It is important to note that the number of operations in this step is proportional to the number of layer tree (leaf) elements, so there is additional work and complexity involved, when the layer tree is large. - -#### Render and retain layer elements -As we saw, the compositing step benefits from caching the intermediate result. This does result in the extra memory usage, so needs to be considered during the layer tree creation. Beyond the caching, we can accelerate the rendering of each element by (indirectly) using the available platform APIs (e.g., Direct2D, CoreGraphics, even some of the 3D APIs like OpenGL or Direct3D) as available. This is actually done through a platform independent API (see Moz2D) below, but is important to realize it does get accelerated appropriately. - -#### Creating the layer tree -We need to create a layer tree (from the frames tree), which will give us the correct result while striking the right balance between a layer per frame element and a single layer for the complete frames tree. As was mentioned above, there is an overhead in traversing the whole tree and caching each of the elements, balanced by the performance improvements. Some of the performance improvements are only noticed when something changes (e.g., one element is moving, we only need to redo the compositing step). - -### Refresh Driver - -### Layers - -#### Rendering each layer - -### Tiling vs. Buffer Rotation vs. Full paint - -#### Compositing for the final result - -### Graphics API - -#### Moz2D -* The Moz2D graphics API, part of the Azure project, is a cross-platform interface onto the various graphics backends that Gecko uses for rendering such as Direct2D (1.0 and 1.1), Skia, Cairo, Quartz, and NV Path. Adding a new graphics platform to Gecko is accomplished by adding a backend to Moz2D. -\see [Moz2D documentation on wiki](https://wiki.mozilla.org/Platform/GFX/Moz2D) - -#### Compositing - -#### Image Decoding - -#### Image Animation - -### Funny words -There are a lot of code words that we use to refer to projects, libraries, areas of the code. Here's an attempt to cover some of those: -* Azure - See Moz2D in the Graphics API section above. -* Backend - See Moz2D in the Graphics API section above. -* Cairo - http://www.cairographics.org/. Cairo is a 2D graphics library with support for multiple output devices. Currently supported output targets include the X Window System (via both Xlib and XCB), Quartz, Win32, image buffers, PostScript, PDF, and SVG file output. -* Moz2D - See Moz2D in the Graphics API section above. -* Thebes - Graphics API that preceded Moz2D. -* Reflow -* Display list - -### [Historical Documents](http://www.youtube.com/watch?v=lLZQz26-kms) -A number of posts and blogs that will give you more details or more background, or reasoning that led to different solutions and approaches. - -* 2010-01 [Layers: Cross Platform Acceleration] (http://www.basschouten.com/blog1.php/layers-cross-platform-acceleration) -* 2010-04 [Layers] (http://robert.ocallahan.org/2010/04/layers_01.html) -* 2010-07 [Retained Layers](http://robert.ocallahan.org/2010/07/retained-layers_16.html) -* 2011-04 [Introduction](https://blog.mozilla.org/joe/2011/04/26/introducing-the-azure-project/ Moz2D) -* 2011-07 [Layers](http://chrislord.net/index.php/2011/07/25/shadow-layers-and-learning-by-failing/ Shadow) -* 2011-09 [Graphics API Design](http://robert.ocallahan.org/2011/09/graphics-api-design.html) -* 2012-04 [Moz2D Canvas on OSX](http://muizelaar.blogspot.ca/2012/04/azure-canvas-on-os-x.html) -* 2012-05 [Mask Layers](http://featherweightmusings.blogspot.co.uk/2012/05/mask-layers_26.html) -* 2013-07 [Graphics related](http://www.basschouten.com/blog1.php) - diff --git a/gfx/doc/LayersHistory.md b/gfx/doc/LayersHistory.md deleted file mode 100644 index 2833aa3c5b2b..000000000000 --- a/gfx/doc/LayersHistory.md +++ /dev/null @@ -1,60 +0,0 @@ -This is an overview of the major events in the history of our Layers infrastructure. - -- iPhone released in July 2007 (Built on a toolkit called LayerKit) - -- Core Animation (October 2007) LayerKit was publicly renamed to OS X 10.5 - -- Webkit CSS 3d transforms (July 2009) - -- Original layers API (March 2010) Introduced the idea of a layer manager that - would composite. One of the first use cases for this was hardware accelerated - YUV conversion for video. - -- Retained layers (July 7 2010 - Bug 564991) -This was an important concept that introduced the idea of persisting the layer -content across paints in gecko controlled buffers instead of just by the OS. This introduced -the concept of buffer rotation to deal with scrolling instead of using the -native scrolling APIs like ScrollWindowEx - -- Layers IPC (July 2010 - Bug 570294) -This introduced shadow layers and edit lists and was originally done for e10s v1 - -- 3d transforms (September 2011 - Bug 505115) - -- OMTC (December 2011 - Bug 711168) -This was prototyped on OS X but shipped first for Fennec - -- Tiling v1 (April 2012 - Bug 739679) -Originally done for Fennec. -This was done to avoid situations where we had to do a bunch of work for -scrolling a small amount. i.e. buffer rotation. It allowed us to have a -variety of interesting features like progressive painting and lower resolution -painting. - -- C++ Async pan zoom controller (July 2012 - Bug 750974) -The existing APZ code was in Java for Fennec so this was reimplemented. - -- Streaming WebGL Buffers (February 2013 - Bug 716859) -Infrastructure to allow OMTC WebGL and avoid the need to glFinish() every -frame. - -- Compositor API (April 2013 - Bug 825928) -The planning for this started around November 2012. -Layers refactoring created a compositor API that abstracted away the differences between the -D3D vs OpenGL. The main piece of API is DrawQuad. - -- Tiling v2 (Mar 7 2014 - Bug 963073) -Tiling for B2G. This work is mainly porting tiled layers to new textures, -implementing double-buffered tiles and implementing a texture client pool, to -be used by tiled content clients. - - A large motivation for the pool was the very slow performance of allocating tiles because -of the sync messages to the compositor. - - The slow performance of allocating was directly addressed by bug 959089 which allowed us -to allocate gralloc buffers without sync messages to the compositor thread. - -- B2G WebGL performance (May 2014 - Bug 1006957, 1001417, 1024144) -This work improved the synchronization mechanism between the compositor -and the producer. - diff --git a/gfx/doc/MainPage.md b/gfx/doc/MainPage.md deleted file mode 100644 index 70a9fc60a95f..000000000000 --- a/gfx/doc/MainPage.md +++ /dev/null @@ -1,21 +0,0 @@ -Mozilla Graphics {#mainpage} -====================== - -## Work in progress. Possibly incorrect or incomplete. - - -Introduction -------- -This collection of linked pages contains a combination of Doxygen -extracted source code documentation and design documents for the -Mozilla graphics architecture. The design documents live in gfx/docs directory. - -This [wiki page](https://wiki.mozilla.org/Platform/GFX) contains -information about graphics and the graphics team at MoCo. - -Continue here for a [very high level introductory overview](@ref graphicsoverview) -if you don't know where to start. - -Useful pointers for creating documentation ------- -[The mechanics of creating these files](https://wiki.mozilla.org/Platform/GFX/DesignDocumentationGuidelines) diff --git a/gfx/doc/Silk.md b/gfx/doc/Silk.md deleted file mode 100644 index 59b7dceca6dd..000000000000 --- a/gfx/doc/Silk.md +++ /dev/null @@ -1,246 +0,0 @@ -Silk Architecture Overview -================= - -#Architecture -Our current architecture is to align three components to hardware vsync timers: - -1. Compositor -2. RefreshDriver / Painting -3. Input Events - -The flow of our rendering engine is as follows: - -1. Hardware Vsync event occurs on an OS specific *Hardware Vsync Thread* on a per monitor basis. -2. The *Hardware Vsync Thread* attached to the monitor notifies the **CompositorVsyncDispatchers** and **RefreshTimerVsyncDispatcher**. -3. For every Firefox window on the specific monitor, notify a **CompositorVsyncDispatcher**. The **CompositorVsyncDispatcher** is specific to one window. -4. The **CompositorVsyncDispatcher** notifies a **CompositorWidgetVsyncObserver** when remote compositing, or a **CompositorVsyncScheduler::Observer** when compositing in-process. -5. If remote compositing, a vsync notification is sent from the **CompositorWidgetVsyncObserver** to the **VsyncBridgeChild** on the UI process, which sends an IPDL message to the **VsyncBridgeParent** on the compositor thread of the GPU process, which then dispatches to **CompositorVsyncScheduler::Observer**. -6. The **RefreshTimerVsyncDispatcher** notifies the Chrome **RefreshTimer** that a vsync has occured. -7. The **RefreshTimerVsyncDispatcher** sends IPC messages to all content processes to tick their respective active **RefreshTimer**. -8. The **Compositor** dispatches input events on the *Compositor Thread*, then composites. Input events are only dispatched on the *Compositor Thread* on b2g. -9. The **RefreshDriver** paints on the *Main Thread*. - -The implementation is broken into the following sections and will reference this figure. Note that **Objects** are bold fonts while *Threads* are italicized. - - - -#Hardware Vsync -Hardware vsync events from (1), occur on a specific **Display** Object. -The **Display** object is responsible for enabling / disabling vsync on a per connected display basis. -For example, if two monitors are connected, two **Display** objects will be created, each listening to vsync events for their respective displays. -We require one **Display** object per monitor as each monitor may have different vsync rates. -As a fallback solution, we have one global **Display** object that can synchronize across all connected displays. -The global **Display** is useful if a window is positioned halfway between the two monitors. -Each platform will have to implement a specific **Display** object to hook and listen to vsync events. -As of this writing, both Firefox OS and OS X create their own hardware specific *Hardware Vsync Thread* that executes after a vsync has occured. -OS X creates one *Hardware Vsync Thread* per **CVDisplayLinkRef**. -We do not currently support multiple displays, so we use one global **CVDisplayLinkRef** that works across all active displays. -On Windows, we have to create a new platform *thread* that waits for DwmFlush(), which works across all active displays. -Once the thread wakes up from DwmFlush(), the actual vsync timestamp is retrieved from DwmGetCompositionTimingInfo(), which is the timestamp that is actually passed into the compositor and refresh driver. - -When a vsync occurs on a **Display**, the *Hardware Vsync Thread* callback fetches all **CompositorVsyncDispatchers** associated with the **Display**. -Each **CompositorVsyncDispatcher** is notified that a vsync has occured with the vsync's timestamp. -It is the responsibility of the **CompositorVsyncDispatcher** to notify the **Compositor** that is awaiting vsync notifications. -The **Display** will then notify the associated **RefreshTimerVsyncDispatcher**, which should notify all active **RefreshDrivers** to tick. - -All **Display** objects are encapsulated in a **VsyncSource** object. -The **VsyncSource** object lives in **gfxPlatform** and is instantiated only on the parent process when **gfxPlatform** is created. -The **VsyncSource** is destroyed when **gfxPlatform** is destroyed. -There is only one **VsyncSource** object throughout the entire lifetime of Firefox. -Each platform is expected to implement their own **VsyncSource** to manage vsync events. -On Firefox OS, this is through the **HwcComposer2D**. -On OS X, this is through **CVDisplayLinkRef**. -On Windows, it should be through **DwmGetCompositionTimingInfo**. - -#Compositor -When the **CompositorVsyncDispatcher** is notified of the vsync event, the **CompositorVsyncScheduler::Observer** associated with the **CompositorVsyncDispatcher** begins execution. -Since the **CompositorVsyncDispatcher** executes on the *Hardware Vsync Thread* and the **Compositor** composites on the *CompositorThread*, the **CompositorVsyncScheduler::Observer** posts a task to the *CompositorThread*. -The **CompositorBridgeParent** then composites. -The model where the **CompositorVsyncDispatcher** notifies components on the *Hardware Vsync Thread*, and the component schedules the task on the appropriate thread is used everywhere. - -The **CompositorVsyncScheduler::Observer** listens to vsync events as needed and stops listening to vsync when composites are no longer scheduled or required. -Every **CompositorBridgeParent** is associated and tied to one **CompositorVsyncScheduler::Observer**, which is associated with the **CompositorVsyncDispatcher**. -Each **CompositorBridgeParent** is associated with one widget and is created when a new platform window or **nsBaseWidget** is created. -The **CompositorBridgeParent**, **CompositorVsyncDispatcher**, **CompositorVsyncScheduler::Observer**, and **nsBaseWidget** all have the same lifetimes, which are created and destroyed together. - -##Out-of-process Compositors -When compositing out-of-process, this model changes slightly. -In this case there are effectively two observers: a UI process observer (**CompositorWidgetVsyncObserver**), and the **CompositorVsyncScheduler::Observer** in the GPU process. -There are also two dispatchers: the widget dispatcher in the UI process (**CompositorVsyncDispatcher**), and the IPDL-based dispatcher in the GPU process (**CompositorBridgeParent::NotifyVsync**). -The UI process observer and the GPU process dispatcher are linked via an IPDL protocol called PVsyncBridge. -**PVsyncBridge** is a top-level protocol for sending vsync notifications to the compositor thread in the GPU process. -The compositor controls vsync observation through a separate actor, **PCompositorWidget**, which (as a subactor for **CompositorBridgeChild**) links the compositor thread in the GPU process to the main thread in the UI process. - -Out-of-process compositors do not go through **CompositorVsyncDispatcher** directly. -Instead, the **CompositorWidgetDelegate** in the UI process creates one, and gives it a **CompositorWidgetVsyncObserver**. -This observer forwards notifications to a Vsync I/O thread, where **VsyncBridgeChild** then forwards the notification again to the compositor thread in the GPU process. -The notification is received by a **VsyncBridgeParent**. -The GPU process uses the layers ID in the notification to find the correct compositor to dispatch the notification to. - -###CompositorVsyncDispatcher -The **CompositorVsyncDispatcher** executes on the *Hardware Vsync Thread*. -It contains references to the **nsBaseWidget** it is associated with and has a lifetime equal to the **nsBaseWidget**. -The **CompositorVsyncDispatcher** is responsible for notifying the **CompositorBridgeParent** that a vsync event has occured. -There can be multiple **CompositorVsyncDispatchers** per **Display**, one **CompositorVsyncDispatcher** per window. -The only responsibility of the **CompositorVsyncDispatcher** is to notify components when a vsync event has occured, and to stop listening to vsync when no components require vsync events. -We require one **CompositorVsyncDispatcher** per window so that we can handle multiple **Displays**. -When compositing in-process, the **CompositorVsyncDispatcher** is attached to the CompositorWidget for the -window. When out-of-process, it is attached to the CompositorWidgetDelegate, which forwards -observer notifications over IPDL. In the latter case, its lifetime is tied to a CompositorSession -rather than the nsIWidget. - -###Multiple Displays -The **VsyncSource** has an API to switch a **CompositorVsyncDispatcher** from one **Display** to another **Display**. -For example, when one window either goes into full screen mode or moves from one connected monitor to another. -When one window moves to another monitor, we expect a platform specific notification to occur. -The detection of when a window enters full screen mode or moves is not covered by Silk itself, but the framework is built to support this use case. -The expected flow is that the OS notification occurs on **nsIWidget**, which retrieves the associated **CompositorVsyncDispatcher**. -The **CompositorVsyncDispatcher** then notifies the **VsyncSource** to switch to the correct **Display** the **CompositorVsyncDispatcher** is connected to. -Because the notification works through the **nsIWidget**, the actual switching of the **CompositorVsyncDispatcher** to the correct **Display** should occur on the *Main Thread*. -The current implementation of Silk does not handle this case and needs to be built out. - -###CompositorVsyncScheduler::Observer -The **CompositorVsyncScheduler::Observer** handles the vsync notifications and interactions with the **CompositorVsyncDispatcher**. -When the **Compositor** requires a scheduled composite, it notifies the **CompositorVsyncScheduler::Observer** that it needs to listen to vsync. -The **CompositorVsyncScheduler::Observer** then observes / unobserves vsync as needed from the **CompositorVsyncDispatcher** to enable composites. - -###GeckoTouchDispatcher -The **GeckoTouchDispatcher** is a singleton that resamples touch events to smooth out jank while tracking a user's finger. -Because input and composite are linked together, the **CompositorVsyncScheduler::Observer** has a reference to the **GeckoTouchDispatcher** and vice versa. - -###Input Events -One large goal of Silk is to align touch events with vsync events. -On Firefox OS, touchscreens often have different touch scan rates than the display refreshes. -A Flame device has a touch refresh rate of 75 HZ, while a Nexus 4 has a touch refresh rate of 100 HZ, while the device's display refresh rate is 60HZ. -When a vsync event occurs, we resample touch events, and then dispatch the resampled touch event to APZ. -Touch events on Firefox OS occur on a *Touch Input Thread* whereas they are processed by APZ on the *APZ Controller Thread*. -We use [Google Android's touch resampling](http://www.masonchang.com/blog/2014/8/25/androids-touch-resampling-algorithm) algorithm to resample touch events. - -Currently, we have a strict ordering between Composites and touch events. -When a touch event occurs on the *Touch Input Thread*, we store the touch event in a queue. -When a vsync event occurs, the **CompositorVsyncDispatcher** notifies the **Compositor** of a vsync event, which notifies the **GeckoTouchDispatcher**. -The **GeckoTouchDispatcher** processes the touch event first on the *APZ Controller Thread*, which is the same as the *Compositor Thread* on b2g, then the **Compositor** finishes compositing. -We require this strict ordering because if a vsync notification is dispatched to both the **Compositor** and **GeckoTouchDispatcher** at the same time, a race condition occurs between processing the touch event and therefore position versus compositing. -In practice, this creates very janky scrolling. -As of this writing, we have not analyzed input events on desktop platforms. - -One slight quirk is that input events can start a composite, for example during a scroll and after the **Compositor** is no longer listening to vsync events. -In these cases, we notify the **Compositor** to observe vsync so that it dispatches touch events. -If touch events were not dispatched, and since the **Compositor** is not listening to vsync events, the touch events would never be dispatched. -The **GeckoTouchDispatcher** handles this case by always forcing the **Compositor** to listen to vsync events while touch events are occurring. - -###Widget, Compositor, CompositorVsyncDispatcher, GeckoTouchDispatcher Shutdown Procedure -When the [nsBaseWidget shuts down](https://hg.mozilla.org/mozilla-central/file/0df249a0e4d3/widget/nsBaseWidget.cpp#l182) - It calls nsBaseWidget::DestroyCompositor on the *Gecko Main Thread*. -During nsBaseWidget::DestroyCompositor, it first destroys the CompositorBridgeChild. -CompositorBridgeChild sends a sync IPC call to CompositorBridgeParent::RecvStop, which calls [CompositorBridgeParent::Destroy](https://hg.mozilla.org/mozilla-central/file/ab0490972e1e/gfx/layers/ipc/CompositorBridgeParent.cpp#l509). -During this time, the *main thread* is blocked on the parent process. -CompositorBridgeParent::RecvStop runs on the *Compositor thread* and cleans up some resources, including setting the **CompositorVsyncScheduler::Observer** to nullptr. -CompositorBridgeParent::RecvStop also explicitly keeps the CompositorBridgeParent alive and posts another task to run CompositorBridgeParent::DeferredDestroy on the Compositor loop so that all ipdl code can finish executing. -The **CompositorVsyncScheduler::Observer** also unobserves from vsync and cancels any pending composite tasks. -Once CompositorBridgeParent::RecvStop finishes, the *main thread* in the parent process continues shutting down the nsBaseWidget. - -At the same time, the *Compositor thread* is executing tasks until CompositorBridgeParent::DeferredDestroy runs, which flushes the compositor message loop. -Now we have two tasks as both the nsBaseWidget releases a reference to the Compositor on the *main thread* during destruction and the CompositorBridgeParent::DeferredDestroy releases a reference to the CompositorBridgeParent on the *Compositor Thread*. -Finally, the CompositorBridgeParent itself is destroyed on the *main thread* once both references are gone due to explicit [main thread destruction](https://hg.mozilla.org/mozilla-central/file/50b95032152c/gfx/layers/ipc/CompositorBridgeParent.h#l148). - -With the **CompositorVsyncScheduler::Observer**, any accesses to the widget after nsBaseWidget::DestroyCompositor executes are invalid. -Any accesses to the compositor between the time the nsBaseWidget::DestroyCompositor runs and the CompositorVsyncScheduler::Observer's destructor runs aren't safe yet a hardware vsync event could occur between these times. -Since any tasks posted on the Compositor loop after CompositorBridgeParent::DeferredDestroy is posted are invalid, we make sure that no vsync tasks can be posted once CompositorBridgeParent::RecvStop executes and DeferredDestroy is posted on the Compositor thread. -When the sync call to CompositorBridgeParent::RecvStop executes, we explicitly set the CompositorVsyncScheduler::Observer to null to prevent vsync notifications from occurring. -If vsync notifications were allowed to occur, since the **CompositorVsyncScheduler::Observer**'s vsync notification executes on the *hardware vsync thread*, it would post a task to the Compositor loop and may execute after CompositorBridgeParent::DeferredDestroy. -Thus, we explicitly shut down vsync events in the **CompositorVsyncDispatcher** and **CompositorVsyncScheduler::Observer** during nsBaseWidget::Shutdown to prevent any vsync tasks from executing after CompositorBridgeParent::DeferredDestroy. - -The **CompositorVsyncDispatcher** may be destroyed on either the *main thread* or *Compositor Thread*, since both the nsBaseWidget and **CompositorVsyncScheduler::Observer** race to destroy on different threads. -nsBaseWidget is destroyed on the *main thread* and releases a reference to the **CompositorVsyncDispatcher** during destruction. -The **CompositorVsyncScheduler::Observer** has a race to be destroyed either during CompositorBridgeParent shutdown or from the **GeckoTouchDispatcher** which is destroyed on the main thread with [ClearOnShutdown](https://hg.mozilla.org/mozilla-central/file/21567e9a6e40/xpcom/base/ClearOnShutdown.h#l15). -Whichever object, the CompositorBridgeParent or the **GeckoTouchDispatcher** is destroyed last will hold the last reference to the **CompositorVsyncDispatcher**, which destroys the object. - -#Refresh Driver -The Refresh Driver is ticked from a [single active timer](https://hg.mozilla.org/mozilla-central/file/ab0490972e1e/layout/base/nsRefreshDriver.cpp#l11). -The assumption is that there are multiple **RefreshDrivers** connected to a single **RefreshTimer**. -There are two **RefreshTimers**: an active and an inactive **RefreshTimer**. -Each Tab has its own **RefreshDriver**, which connects to one of the global **RefreshTimers**. -The **RefreshTimers** execute on the *Main Thread* and tick their connected **RefreshDrivers**. -We do not want to break this model of multiple **RefreshDrivers** per a set of two global **RefreshTimers**. -Each **RefreshDriver** switches between the active and inactive **RefreshTimer**. - -Instead, we create a new **RefreshTimer**, the **VsyncRefreshTimer** which ticks based on vsync messages. -We replace the current active timer with a **VsyncRefreshTimer**. -All tabs will then tick based on this new active timer. -Since the **RefreshTimer** has a lifetime of the process, we only need to create a single **RefreshTimerVsyncDispatcher** per **Display** when Firefox starts. -Even if we do not have any content processes, the Chrome process will still need a **VsyncRefreshTimer**, thus we can associate the **RefreshTimerVsyncDispatcher** with each **Display**. - -When Firefox starts, we initially create a new **VsyncRefreshTimer** in the Chrome process. -The **VsyncRefreshTimer** will listen to vsync notifications from **RefreshTimerVsyncDispatcher** on the global **Display**. -When nsRefreshDriver::Shutdown executes, it will delete the **VsyncRefreshTimer**. -This creates a problem as all the **RefreshTimers** are currently manually memory managed whereas **VsyncObservers** are ref counted. -To work around this problem, we create a new **RefreshDriverVsyncObserver** as an inner class to **VsyncRefreshTimer**, which actually receives vsync notifications. It then ticks the **RefreshDrivers** inside **VsyncRefreshTimer**. - -With Content processes, the start up process is more complicated. -We send vsync IPC messages via the use of the PBackground thread on the parent process, which allows us to send messages from the Parent process' without waiting on the *main thread*. -This sends messages from the Parent::*PBackground Thread* to the Child::*Main Thread*. -The *main thread* receiving IPC messages on the content process is acceptable because **RefreshDrivers** must execute on the *main thread*. -However, there is some amount of time required to setup the IPC connection upon process creation and during this time, the **RefreshDrivers** must tick to set up the process. -To get around this, we initially use software **RefreshTimers** that already exist during content process startup and swap in the **VsyncRefreshTimer** once the IPC connection is created. - -During nsRefreshDriver::ChooseTimer, we create an async PBackground IPC open request to create a **VsyncParent** and **VsyncChild**. -At the same time, we create a software **RefreshTimer** and tick the **RefreshDrivers** as normal. -Once the PBackground callback is executed and an IPC connection exists, we swap all **RefreshDrivers** currently associated with the active **RefreshTimer** and swap the **RefreshDrivers** to use the **VsyncRefreshTimer**. -Since all interactions on the content process occur on the main thread, there are no need for locks. -The **VsyncParent** listens to vsync events through the **VsyncRefreshTimerDispatcher** on the parent side and sends vsync IPC messages to the **VsyncChild**. -The **VsyncChild** notifies the **VsyncRefreshTimer** on the content process. - -During the shutdown process of the content process, ActorDestroy is called on the **VsyncChild** and **VsyncParent** due to the normal PBackground shutdown process. -Once ActorDestroy is called, no IPC messages should be sent across the channel. -After ActorDestroy is called, the IPDL machinery will delete the **VsyncParent/Child** pair. -The **VsyncParent**, due to being a **VsyncObserver**, is ref counted. -After **VsyncParent::ActorDestroy** is called, it unregisters itself from the **RefreshTimerVsyncDispatcher**, which holds the last reference to the **VsyncParent**, and the object will be deleted. - -Thus the overall flow during normal execution is: - -1. VsyncSource::Display::RefreshTimerVsyncDispatcher receives a Vsync notification from the OS in the parent process. -2. RefreshTimerVsyncDispatcher notifies VsyncRefreshTimer::RefreshDriverVsyncObserver that a vsync occured on the parent process on the hardware vsync thread. -3. RefreshTimerVsyncDispatcher notifies the VsyncParent on the hardware vsync thread that a vsync occured. -4. The VsyncRefreshTimer::RefreshDriverVsyncObserver in the parent process posts a task to the main thread that ticks the refresh drivers. -5. VsyncParent posts a task to the PBackground thread to send a vsync IPC message to VsyncChild. -6. VsyncChild receive a vsync notification on the content process on the main thread and ticks their respective RefreshDrivers. - -###Compressing Vsync Messages -Vsync messages occur quite often and the *main thread* can be busy for long periods of time due to JavaScript. -Consistently sending vsync messages to the refresh driver timer can flood the *main thread* with refresh driver ticks, causing even more delays. -To avoid this problem, we compress vsync messages on both the parent and child processes. - -On the parent process, newer vsync messages update a vsync timestamp but do not actually queue any tasks on the *main thread*. -Once the parent process' *main thread* executes the refresh driver tick, it uses the most updated vsync timestamp to tick the refresh driver. -After the refresh driver has ticked, one single vsync message is queued for another refresh driver tick task. -On the content process, the IPDL **compress** keyword automatically compresses IPC messages. - -### Multiple Monitors -In order to have multiple monitor support for the **RefreshDrivers**, we have multiple active **RefreshTimers**. -Each **RefreshTimer** is associated with a specific **Display** via an id and tick when it's respective **Display** vsync occurs. -We have **N RefreshTimers**, where N is the number of connected displays. -Each **RefreshTimer** still has multiple **RefreshDrivers**. - -When a tab or window changes monitors, the **nsIWidget** receives a display changed notification. -Based on which display the window is on, the window switches to the correct **RefreshTimerVsyncDispatcher** and **CompositorVsyncDispatcher** on the parent process based on the display id. -Each **TabParent** should also send a notification to their child. -Each **TabChild**, given the display ID, switches to the correct **RefreshTimer** associated with the display ID. -When each display vsync occurs, it sends one IPC message to notify vsync. -The vsync message contains a display ID, to tick the appropriate **RefreshTimer** on the content process. -There is still only one **VsyncParent/VsyncChild** pair, just each vsync notification will include a display ID, which maps to the correct **RefreshTimer**. - -#Object Lifetime -1. CompositorVsyncDispatcher - Lives as long as the nsBaseWidget associated with the VsyncDispatcher -2. CompositorVsyncScheduler::Observer - Lives and dies the same time as the CompositorBridgeParent. -3. RefreshTimerVsyncDispatcher - As long as the associated display object, which is the lifetime of Firefox. -4. VsyncSource - Lives as long as the gfxPlatform on the chrome process, which is the lifetime of Firefox. -5. VsyncParent/VsyncChild - Lives as long as the content process -6. RefreshTimer - Lives as long as the process - -#Threads -All **VsyncObservers** are notified on the *Hardware Vsync Thread*. It is the responsibility of the **VsyncObservers** to post tasks to their respective correct thread. For example, the **CompositorVsyncScheduler::Observer** will be notified on the *Hardware Vsync Thread*, and post a task to the *Compositor Thread* to do the actual composition. - -1. Compositor Thread - Nothing changes -2. Main Thread - PVsyncChild receives IPC messages on the main thread. We also enable/disable vsync on the main thread. -3. PBackground Thread - Creates a connection from the PBackground thread on the parent process to the main thread in the content process. -4. Hardware Vsync Thread - Every platform is different, but we always have the concept of a hardware vsync thread. Sometimes this is actually created by the host OS. On Windows, we have to create a separate platform thread that blocks on DwmFlush(). diff --git a/gfx/docs/AdvancedLayers.rst b/gfx/docs/AdvancedLayers.rst new file mode 100644 index 000000000000..708391f5119d --- /dev/null +++ b/gfx/docs/AdvancedLayers.rst @@ -0,0 +1,370 @@ +Advanced Layers +=============== + +Advanced Layers is a new method of compositing layers in Gecko. This +document serves as a technical overview and provides a short +walk-through of its source code. + +Overview +-------- + +Advanced Layers attempts to group as many GPU operations as it can into +a single draw call. This is a common technique in GPU-based rendering +called “batching”. It is not always trivial, as a batching algorithm can +easily waste precious CPU resources trying to build optimal draw calls. + +Advanced Layers reuses the existing Gecko layers system as much as +possible. Huge layer trees do not currently scale well (see the future +work section), so opportunities for batching are currently limited +without expending unnecessary resources elsewhere. However, Advanced +Layers has a few benefits: + +- It submits smaller GPU workloads and buffer uploads than the existing + compositor. +- It needs only a single pass over the layer tree. +- It uses occlusion information more intelligently. +- It is easier to add new specialized rendering paths and new layer + types. +- It separates compositing logic from device logic, unlike the existing + compositor. +- It is much faster at rendering 3d scenes or complex layer trees. +- It has experimental code to use the z-buffer for occlusion culling. + +Because of these benefits we hope that it provides a significant +improvement over the existing compositor. + +Advanced Layers uses the acronym “MLG” and “MLGPU” in many places. This +stands for “Mid-Level Graphics”, the idea being that it is optimized for +Direct3D 11-style rendering systems as opposed to Direct3D 12 or Vulkan. + +LayerManagerMLGPU +----------------- + +Advanced layers does not change client-side rendering at all. Content +still uses Direct2D (when possible), and creates identical layer trees +as it would with a normal Direct3D 11 compositor. In fact, Advanced +Layers re-uses all of the existing texture handling and video +infrastructure as well, replacing only the composite-side layer types. + +Advanced Layers does not create a ``LayerManagerComposite`` - instead, +it creates a ``LayerManagerMLGPU``. This layer manager does not have a +``Compositor`` - instead, it has an ``MLGDevice``, which roughly +abstracts the Direct3D 11 API. (The hope is that this API is easily +interchangeable for something else when cross-platform or software +support is needed.) + +``LayerManagerMLGPU`` also dispenses with the old “composite” layers for +new layer types. For example, ``ColorLayerComposite`` becomes +``ColorLayerMLGPU``. Since these layer types implement ``HostLayer``, +they integrate with ``LayerTransactionParent`` as normal composite +layers would. + +Rendering Overview +------------------ + +The steps for rendering are described in more detail below, but roughly +the process is: + +1. Sort layers front-to-back. +2. Create a dependency tree of render targets (called “views”). +3. Accumulate draw calls for all layers in each view. +4. Upload draw call buffers to the GPU. +5. Execute draw commands for each view. + +Advanced Layers divides the layer tree into “views” +(``RenderViewMLGPU``), which correspond to a render target. The root +layer is represented by a view corresponding to the screen. Layers that +require intermediate surfaces have temporary views. Layers are analyzed +front-to-back, and rendered back-to-front within a view. Views +themselves are rendered front-to-back, to minimize render target +switching. + +Each view contains one or more rendering passes (``RenderPassMLGPU``). A +pass represents a single draw command with one or more rendering items +attached to it. For example, a ``SolidColorPass`` item contains a +rectangle and an RGBA value, and many of these can be drawn with a +single GPU call. + +When considering a layer, views will first try to find an existing +rendering batch that can support it. If so, that pass will accumulate +another draw item for the layer. Otherwise, a new pass will be added. + +When trying to find a matching pass for a layer, there is a tradeoff in +CPU time versus the GPU time saved by not issuing another draw commands. +We generally care more about CPU time, so we do not try too hard in +matching items to an existing batch. + +After all layers have been processed, there is a “prepare” step. This +copies all accumulated draw data and uploads it into vertex and constant +buffers in the GPU. + +Finally, we execute rendering commands. At the end of the frame, all +batches and (most) constant buffers are thrown away. + +Shaders Overview +---------------- + +Advanced Layers currently has five layer-related shader pipelines: + +- Textured (PaintedLayer, ImageLayer, CanvasLayer) +- ComponentAlpha (PaintedLayer with component-alpha) +- YCbCr (ImageLayer with YCbCr video) +- Color (ColorLayers) +- Blend (ContainerLayers with mix-blend modes) + +There are also three special shader pipelines: + +- MaskCombiner, which is used to combine mask layers into a single + texture. +- Clear, which is used for fast region-based clears when not directly + supported by the GPU. +- Diagnostic, which is used to display the diagnostic overlay texture. + +The layer shaders follow a unified structure. Each pipeline has a vertex +and pixel shader. The vertex shader takes a layers ID, a z-buffer depth, +a unit position in either a unit square or unit triangle, and either +rectangular or triangular geometry. Shaders can also have ancillary data +needed like texture coordinates or colors. + +Most of the time, layers have simple rectangular clips with simple +rectilinear transforms, and pixel shaders do not need to perform masking +or clipping. For these layers we use a fast-path pipeline, using +unit-quad shaders that are able to clip geometry so the pixel shader +does not have to. This type of pipeline does not support complex masks. + +If a layer has a complex mask, a rotation or 3d transform, or a complex +operation like blending, then we use shaders capable of handling +arbitrary geometry. Their input is a unit triangle, and these shaders +are generally more expensive. + +All of the shader-specific data is modelled in ShaderDefinitionsMLGPU.h. + +CPU Occlusion Culling +--------------------- + +By default, Advanced Layers performs occlusion culling on the CPU. Since +layers are visited front-to-back, this is simply a matter of +accumulating the visible region of opaque layers, and subtracting it +from the visible region of subsequent layers. There is a major +difference between this occlusion culling and PostProcessLayers of the +old compositor: AL performs culling after invalidation, not before. +Completely valid layers will have an empty visible region. + +Most layer types (with the exception of images) will intelligently split +their draw calls into a batch of individual rectangles, based on their +visible region. + +Z-Buffering and Occlusion +------------------------- + +Advanced Layers also supports occlusion culling on the GPU, using a +z-buffer. This is disabled by default currently since it is +significantly costly on integrated GPUs. When using the z-buffer, we +separate opaque layers into a separate list of passes. The render +process then uses the following steps: + +1. The depth buffer is set to read-write. +2. Opaque batches are executed., +3. The depth buffer is set to read-only. +4. Transparent batches are executed. + +The problem we have observed is that the depth buffer increases writes +to the GPU, and on integrated GPUs this is expensive - we have seen draw +call times increase by 20-30%, which is the wrong direction we want to +take on battery life. In particular on a full screen video, the call to +ClearDepthStencilView plus the actual depth buffer write of the video +can double GPU time. + +For now the depth-buffer is disabled until we can find a compelling case +for it on non-integrated hardware. + +Clipping +-------- + +Clipping is a bit tricky in Advanced Layers. We cannot use the hardware +“scissor” feature, since the clip can change from instance to instance +within a batch. And if using the depth buffer, we cannot write +transparent pixels for the clipped area. As a result we always clip +opaque draw rects in the vertex shader (and sometimes even on the CPU, +as is needed for sane texture coordiantes). Only transparent items are +clipped in the pixel shader. As a result, masked layers and layers with +non-rectangular transforms are always considered transparent, and use a +more flexible clipping pipeline. + +Plane Splitting +--------------- + +Plane splitting is when a 3D transform causes a layer to be split - for +example, one transparent layer may intersect another on a separate +plane. When this happens, Gecko sorts layers using a BSP tree and +produces a list of triangles instead of draw rects. + +These layers cannot use the “unit quad” shaders that support the fast +clipping pipeline. Instead they always use the full triangle-list +shaders that support extended vertices and clipping. + +This is the slowest path we can take when building a draw call, since we +must interact with the polygon clipping and texturing code. + +Masks +----- + +For each layer with a mask attached, Advanced Layers builds a +``MaskOperation``. These operations must resolve to a single mask +texture, as well as a rectangular area to which the mask applies. All +batched pixel shaders will automatically clip pixels to the mask if a +mask texture is bound. (Note that we must use separate batches if the +mask texture changes.) + +Some layers have multiple mask textures. In this case, the MaskOperation +will store the list of masks, and right before rendering, it will invoke +a shader to combine these masks into a single texture. + +MaskOperations are shared across layers when possible, but are not +cached across frames. + +BigImage Support +---------------- + +ImageLayers and CanvasLayers can be tiled with many individual textures. +This happens in rare cases where the underlying buffer is too big for +the GPU. Early on this caused problems for Advanced Layers, since AL +required one texture per layer. We implemented BigImage support by +creating temporary ImageLayers for each visible tile, and throwing those +layers away at the end of the frame. + +Advanced Layers no longer has a 1:1 layer:texture restriction, but we +retain the temporary layer solution anyway. It is not much code and it +means we do not have to split ``TexturedLayerMLGPU`` methods into +iterated and non-iterated versions. + +Texture Locking +--------------- + +Advanced Layers has a different texture locking scheme than the existing +compositor. If a texture needs to be locked, then it is locked by the +MLGDevice automatically when bound to the current pipeline. The +MLGDevice keeps a set of the locked textures to avoid double-locking. At +the end of the frame, any textures in the locked set are unlocked. + +We cannot easily replicate the locking scheme in the old compositor, +since the duration of using the texture is not scoped to when we visit +the layer. + +Buffer Measurements +------------------- + +Advanced Layers uses constant buffers to send layer information and +extended instance data to the GPU. We do this by pre-allocating large +constant buffers and mapping them with ``MAP_DISCARD`` at the beginning +of the frame. Batches may allocate into this up to the maximum bindable +constant buffer size of the device (currently, 64KB). + +There are some downsides to this approach. Constant buffers are +difficult to work with - they have specific alignment requirements, and +care must be taken not too run over the maximum number of constants in a +buffer. Another approach would be to store constants in a 2D texture and +use vertex shader texture fetches. Advanced Layers implemented this and +benchmarked it to decide which approach to use. Textures seemed to skew +better on GPU performance, but worse on CPU, but this varied depending +on the GPU. Overall constant buffers performed best and most +consistently, so we have kept them. + +Additionally, we tested different ways of performing buffer uploads. +Buffer creation itself is costly, especially on integrated GPUs, and +especially so for immutable, immediate-upload buffers. As a result we +aggressively cache buffer objects and always allocate them as +MAP_DISCARD unless they are write-once and long-lived. + +Buffer Types +------------ + +Advanced Layers has a few different classes to help build and upload +buffers to the GPU. They are: + +- ``MLGBuffer``. This is the low-level shader resource that + ``MLGDevice`` exposes. It is the building block for buffer helper + classes, but it can also be used to make one-off, immutable, + immediate-upload buffers. MLGBuffers, being a GPU resource, are + reference counted. +- ``SharedBufferMLGPU``. These are large, pre-allocated buffers that + are read-only on the GPU and write-only on the CPU. They usually + exceed the maximum bindable buffer size. There are three shared + buffers created by default and they are automatically unmapped as + needed: one for vertices, one for vertex shader constants, and one + for pixel shader constants. When callers allocate into a shared + buffer they get back a mapped pointer, a GPU resource, and an offset. + When the underlying device supports offsetable buffers (like + ``ID3D11DeviceContext1`` does), this results in better GPU + utilization, as there are less resources and fewer upload commands. +- ``ConstantBufferSection`` and ``VertexBufferSection``. These are + “views” into a ``SharedBufferMLGPU``. They contain the underlying + ``MLGBuffer``, and when offsetting is supported, the offset + information necessary for resource binding. Sections are not + reference counted. +- ``StagingBuffer``. A dynamically sized CPU buffer where items can be + appended in a free-form manner. The stride of a single “item” is + computed by the first item written, and successive items must have + the same stride. The buffer must be uploaded to the GPU manually. + Staging buffers are appropriate for creating general constant or + vertex buffer data. They can also write items in reverse, which is + how we render back-to-front when layers are visited front-to-back. + They can be uploaded to a ``SharedBufferMLGPU`` or an immutabler + ``MLGBuffer`` very easily. Staging buffers are not reference counted. + +Unsupported Features +-------------------- + +Currently, these features of the old compositor are not yet implemented. + +- OpenGL and software support (currently AL only works on D3D11). +- APZ displayport overlay. +- Diagnostic/developer overlays other than the FPS/timing overlay. +- DEAA. It was never ported to the D3D11 compositor, but we would like + it. +- Component alpha when used inside an opaque intermediate surface. +- Effects prefs. Possibly not needed post-B2G removal. +- Widget overlays and underlays used by macOS and Android. +- DefaultClearColor. This is Android specific, but is easy to added + when needed. +- Frame uniformity info in the profiler. Possibly not needed post-B2G + removal. +- LayerScope. There are no plans to make this work. + +Future Work +----------- + +- Refactor for D3D12/Vulkan support (namely, split MLGDevice into + something less stateful and something else more low-level). +- Remove “MLG” moniker and namespace everything. +- Other backends (D3D12/Vulkan, OpenGL, Software) +- Delete CompositorD3D11 +- Add DEAA support +- Re-enable the depth buffer by default for fast GPUs +- Re-enable right-sizing of inaccurately sized containers +- Drop constant buffers for ancillary vertex data +- Fast shader paths for simple video/painted layer cases + +History +------- + +Advanced Layers has gone through four major design iterations. The +initial version used tiling - each render view divided the screen into +128x128 tiles, and layers were assigned to tiles based on their +screen-space draw area. This approach proved not to scale well to 3d +transforms, and so tiling was eliminated. + +We replaced it with a simple system of accumulating draw regions to each +batch, thus ensuring that items could be assigned to batches while +maintaining correct z-ordering. This second iteration also coincided +with plane-splitting support. + +On large layer trees, accumulating the affected regions of batches +proved to be quite expensive. This led to a third iteration, using depth +buffers and separate opaque and transparent batch lists to achieve +z-ordering and occlusion culling. + +Finally, depth buffers proved to be too expensive, and we introduced a +simple CPU-based occlusion culling pass. This iteration coincided with +using more precise draw rects and splitting pipelines into unit-quad, +cpu-clipped and triangle-list, gpu-clipped variants. diff --git a/gfx/docs/AsyncPanZoom.rst b/gfx/docs/AsyncPanZoom.rst new file mode 100644 index 000000000000..bfe4a50c1032 --- /dev/null +++ b/gfx/docs/AsyncPanZoom.rst @@ -0,0 +1,452 @@ +.. _apz: + +Asynchronous Panning and Zooming +================================ + +**This document is a work in progress. Some information may be missing +or incomplete.** + +.. image:: AsyncPanZoomArchitecture.png + +Goals +----- + +We need to be able to provide a visual response to user input with +minimal latency. In particular, on devices with touch input, content +must track the finger exactly while panning, or the user experience is +very poor. According to the UX team, 120ms is an acceptable latency +between user input and response. + +Context and surrounding architecture +------------------------------------ + +The fundamental problem we are trying to solve with the Asynchronous +Panning and Zooming (APZ) code is that of responsiveness. By default, +web browsers operate in a “game loop” that looks like this: + +:: + + while true: + process input + do computations + repaint content + display repainted content + +In browsers the “do computation” step can be arbitrarily expensive +because it can involve running event handlers in web content. Therefore, +there can be an arbitrary delay between the input being received and the +on-screen display getting updated. + +Responsiveness is always good, and with touch-based interaction it is +even more important than with mouse or keyboard input. In order to +ensure responsiveness, we split the “game loop” model of the browser +into a multithreaded variant which looks something like this: + +:: + + Thread 1 (compositor thread) + while true: + receive input + send a copy of input to thread 2 + adjust painted content based on input + display adjusted painted content + + Thread 2 (main thread) + while true: + receive input from thread 1 + do computations + repaint content + update the copy of painted content in thread 1 + +This multithreaded model is called off-main-thread compositing (OMTC), +because the compositing (where the content is displayed on-screen) +happens on a separate thread from the main thread. Note that this is a +very very simplified model, but in this model the “adjust painted +content based on input” is the primary function of the APZ code. + +The “painted content” is stored on a set of “layers”, that are +conceptually double-buffered. That is, when the main thread does its +repaint, it paints into one set of layers (the “client” layers). The +update that is sent to the compositor thread copies all the changes from +the client layers into another set of layers that the compositor holds. +These layers are called the “shadow” layers or the “compositor” layers. +The compositor in theory can continuously composite these shadow layers +to the screen while the main thread is busy doing other things and +painting a new set of client layers. + +The APZ code takes the input events that are coming in from the hardware +and uses them to figure out what the user is trying to do (e.g. pan the +page, zoom in). It then expresses this user intention in the form of +translation and/or scale transformation matrices. These transformation +matrices are applied to the shadow layers at composite time, so that +what the user sees on-screen reflects what they are trying to do as +closely as possible. + +Technical overview +------------------ + +As per the heavily simplified model described above, the fundamental +purpose of the APZ code is to take input events and produce +transformation matrices. This section attempts to break that down and +identify the different problems that make this task non-trivial. + +Checkerboarding +~~~~~~~~~~~~~~~ + +The content area that is painted and stored in a shadow layer is called +the “displayport”. The APZ code is responsible for determining how large +the displayport should be. On the one hand, we want the displayport to +be as large as possible. At the very least it needs to be larger than +what is visible on-screen, because otherwise, as soon as the user pans, +there will be some unpainted area of the page exposed. However, we +cannot always set the displayport to be the entire page, because the +page can be arbitrarily long and this would require an unbounded amount +of memory to store. Therefore, a good displayport size is one that is +larger than the visible area but not so large that it is a huge drain on +memory. Because the displayport is usually smaller than the whole page, +it is always possible for the user to scroll so fast that they end up in +an area of the page outside the displayport. When this happens, they see +unpainted content; this is referred to as “checkerboarding”, and we try +to avoid it where possible. + +There are many possible ways to determine what the displayport should be +in order to balance the tradeoffs involved (i.e. having one that is too +big is bad for memory usage, and having one that is too small results in +excessive checkerboarding). Ideally, the displayport should cover +exactly the area that we know the user will make visible. Although we +cannot know this for sure, we can use heuristics based on current +panning velocity and direction to ensure a reasonably-chosen displayport +area. This calculation is done in the APZ code, and a new desired +displayport is frequently sent to the main thread as the user is panning +around. + +Multiple layers +~~~~~~~~~~~~~~~ + +Consider, for example, a scrollable page that contains an iframe which +itself is scrollable. The iframe can be scrolled independently of the +top-level page, and we would like both the page and the iframe to scroll +responsively. This means that we want independent asynchronous panning +for both the top-level page and the iframe. In addition to iframes, +elements that have the overflow:scroll CSS property set are also +scrollable, and also end up on separate scrollable layers. In the +general case, the layers are arranged in a tree structure, and so within +the APZ code we have a matching tree of AsyncPanZoomController (APZC) +objects, one for each scrollable layer. To manage this tree of APZC +instances, we have a single APZCTreeManager object. Each APZC is +relatively independent and handles the scrolling for its associated +layer, but there are some cases in which they need to interact; these +cases are described in the sections below. + +Hit detection +~~~~~~~~~~~~~ + +Consider again the case where we have a scrollable page that contains an +iframe which itself is scrollable. As described above, we will have two +APZC instances - one for the page and one for the iframe. When the user +puts their finger down on the screen and moves it, we need to do some +sort of hit detection in order to determine whether their finger is on +the iframe or on the top-level page. Based on where their finger lands, +the appropriate APZC instance needs to handle the input. This hit +detection is also done in the APZCTreeManager, as it has the necessary +information about the sizes and positions of the layers. Currently this +hit detection is not perfect, as it uses rects and does not account for +things like rounded corners and opacity. + +Also note that for some types of input (e.g. when the user puts two +fingers down to do a pinch) we do not want the input to be “split” +across two different APZC instances. In the case of a pinch, for +example, we find a “common ancestor” APZC instance - one that is +zoomable and contains all of the touch input points, and direct the +input to that APZC instance. + +Scroll Handoff +~~~~~~~~~~~~~~ + +Consider yet again the case where we have a scrollable page that +contains an iframe which itself is scrollable. Say the user scrolls the +iframe so that it reaches the bottom. If the user continues panning on +the iframe, the expectation is that the top-level page will start +scrolling. However, as discussed in the section on hit detection, the +APZC instance for the iframe is separate from the APZC instance for the +top-level page. Thus, we need the two APZC instances to communicate in +some way such that input events on the iframe result in scrolling on the +top-level page. This behaviour is referred to as “scroll handoff” (or +“fling handoff” in the case where analogous behaviour results from the +scrolling momentum of the page after the user has lifted their finger). + +Input event untransformation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The APZC architecture by definition results in two copies of a “scroll +position” for each scrollable layer. There is the original copy on the +main thread that is accessible to web content and the layout and +painting code. And there is a second copy on the compositor side, which +is updated asynchronously based on user input, and corresponds to what +the user visually sees on the screen. Although these two copies may +diverge temporarily, they are reconciled periodically. In particular, +they diverge while the APZ code is performing an async pan or zoom +action on behalf of the user, and are reconciled when the APZ code +requests a repaint from the main thread. + +Because of the way input events are stored, this has some unfortunate +consequences. Input events are stored relative to the device screen - so +if the user touches at the same physical spot on the device, the same +input events will be delivered regardless of the content scroll +position. When the main thread receives a touch event, it combines that +with the content scroll position in order to figure out what DOM element +the user touched. However, because we now have two different scroll +positions, this process may not work perfectly. A concrete example +follows: + +Consider a device with screen size 600 pixels tall. On this device, a +user is viewing a document that is 1000 pixels tall, and that is +scrolled down by 200 pixels. That is, the vertical section of the +document from 200px to 800px is visible. Now, if the user touches a +point 100px from the top of the physical display, the hardware will +generate a touch event with y=100. This will get sent to the main +thread, which will add the scroll position (200) and get a +document-relative touch event with y=300. This new y-value will be used +in hit detection to figure out what the user touched. If the document +had a absolute-positioned div at y=300, then that would receive the +touch event. + +Now let us add some async scrolling to this example. Say that the user +additionally scrolls the document by another 10 pixels asynchronously +(i.e. only on the compositor thread), and then does the same touch +event. The same input event is generated by the hardware, and as before, +the document will deliver the touch event to the div at y=300. However, +visually, the document is scrolled by an additional 10 pixels so this +outcome is wrong. What needs to happen is that the APZ code needs to +intercept the touch event and account for the 10 pixels of asynchronous +scroll. Therefore, the input event with y=100 gets converted to y=110 in +the APZ code before being passed on to the main thread. The main thread +then adds the scroll position it knows about and determines that the +user touched at a document-relative position of y=310. + +Analogous input event transformations need to be done for horizontal +scrolling and zooming. + +Content independently adjusting scrolling +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As described above, there are two copies of the scroll position in the +APZ architecture - one on the main thread and one on the compositor +thread. Usually for architectures like this, there is a single “source +of truth” value and the other value is simply a copy. However, in this +case that is not easily possible to do. The reason is that both of these +values can be legitimately modified. On the compositor side, the input +events the user is triggering modify the scroll position, which is then +propagated to the main thread. However, on the main thread, web content +might be running Javascript code that programatically sets the scroll +position (via window.scrollTo, for example). Scroll changes driven from +the main thread are just as legitimate and need to be propagated to the +compositor thread, so that the visual display updates in response. + +Because the cross-thread messaging is asynchronous, reconciling the two +types of scroll changes is a tricky problem. Our design solves this +using various flags and generation counters. The general heuristic we +have is that content-driven scroll position changes (e.g. scrollTo from +JS) are never lost. For instance, if the user is doing an async scroll +with their finger and content does a scrollTo in the middle, then some +of the async scroll would occur before the “jump” and the rest after the +“jump”. + +Content preventing default behaviour of input events +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Another problem that we need to deal with is that web content is allowed +to intercept touch events and prevent the “default behaviour” of +scrolling. This ability is defined in web standards and is +non-negotiable. Touch event listeners in web content are allowed call +preventDefault() on the touchstart or first touchmove event for a touch +point; doing this is supposed to “consume” the event and prevent +touch-based panning. As we saw in a previous section, the input event +needs to be untransformed by the APZ code before it can be delivered to +content. But, because of the preventDefault problem, we cannot fully +process the touch event in the APZ code until content has had a chance +to handle it. Web browsers in general solve this problem by inserting a +delay of up to 300ms before processing the input - that is, web content +is allowed up to 300ms to process the event and call preventDefault on +it. If web content takes longer than 300ms, or if it completes handling +of the event without calling preventDefault, then the browser +immediately starts processing the events. + +The way the APZ implementation deals with this is that upon receiving a +touch event, it immediately returns an untransformed version that can be +dispatched to content. It also schedules a 400ms timeout (600ms on +Android) during which content is allowed to prevent scrolling. There is +an API that allows the main-thread event dispatching code to notify the +APZ as to whether or not the default action should be prevented. If the +APZ content response timeout expires, or if the main-thread event +dispatching code notifies the APZ of the preventDefault status, then the +APZ continues with the processing of the events (which may involve +discarding the events). + +The touch-action CSS property from the pointer-events spec is intended +to allow eliminating this 400ms delay in many cases (although for +backwards compatibility it will still be needed for a while). Note that +even with touch-action implemented, there may be cases where the APZ +code does not know the touch-action behaviour of the point the user +touched. In such cases, the APZ code will still wait up to 400ms for the +main thread to provide it with the touch-action behaviour information. + +Technical details +----------------- + +This section describes various pieces of the APZ code, and goes into +more specific detail on APIs and code than the previous sections. The +primary purpose of this section is to help people who plan on making +changes to the code, while also not going into so much detail that it +needs to be updated with every patch. + +Overall flow of input events +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section describes how input events flow through the APZ code. + +1. Input events arrive from the hardware/widget code into the APZ via + APZCTreeManager::ReceiveInputEvent. The thread that invokes this is + called the input thread, and may or may not be the same as the Gecko + main thread. +2. Conceptually the first thing that the APZCTreeManager does is to + associate these events with “input blocks”. An input block is a set + of events that share certain properties, and generally are intended + to represent a single gesture. For example with touch events, all + events following a touchstart up to but not including the next + touchstart are in the same block. All of the events in a given block + will go to the same APZC instance and will either all be processed + or all be dropped. +3. Using the first event in the input block, the APZCTreeManager does a + hit-test to see which APZC it hits. This hit-test uses the event + regions populated on the layers, which may be larger than the true + hit area of the layer. If no APZC is hit, the events are discarded + and we jump to step 6. Otherwise, the input block is tagged with the + hit APZC as a tentative target and put into a global APZ input + queue. +4. + + i. If the input events landed outside the dispatch-to-content event + region for the layer, any available events in the input block + are processed. These may trigger behaviours like scrolling or + tap gestures. + ii. If the input events landed inside the dispatch-to-content event + region for the layer, the events are left in the queue and a + 400ms timeout is initiated. If the timeout expires before step 9 + is completed, the APZ assumes the input block was not cancelled + and the tentative target is correct, and processes them as part + of step 10. + +5. The call stack unwinds back to APZCTreeManager::ReceiveInputEvent, + which does an in-place modification of the input event so that any + async transforms are removed. +6. The call stack unwinds back to the widget code that called + ReceiveInputEvent. This code now has the event in the coordinate + space Gecko is expecting, and so can dispatch it to the Gecko main + thread. +7. Gecko performs its own usual hit-testing and event dispatching for + the event. As part of this, it records whether any touch listeners + cancelled the input block by calling preventDefault(). It also + activates inactive scrollframes that were hit by the input events. +8. The call stack unwinds back to the widget code, which sends two + notifications to the APZ code on the input thread. The first + notification is via APZCTreeManager::ContentReceivedInputBlock, and + informs the APZ whether the input block was cancelled. The second + notification is via APZCTreeManager::SetTargetAPZC, and informs the + APZ of the results of the Gecko hit-test during event dispatch. Note + that Gecko may report that the input event did not hit any + scrollable frame at all. The SetTargetAPZC notification happens only + once per input block, while the ContentReceivedInputBlock + notification may happen once per block, or multiple times per block, + depending on the input type. +9. + + i. If the events were processed as part of step 4(i), the + notifications from step 8 are ignored and step 10 is skipped. + ii. If events were queued as part of step 4(ii), and steps 5-8 take + less than 400ms, the arrival of both notifications from step 8 + will mark the input block ready for processing. + iii. If events were queued as part of step 4(ii), but steps 5-8 take + longer than 400ms, the notifications from step 8 will be + ignored and step 10 will already have happened. + +10. If events were queued as part of step 4(ii) they are now either + processed (if the input block was not cancelled and Gecko detected a + scrollframe under the input event, or if the timeout expired) or + dropped (all other cases). Note that the APZC that processes the + events may be different at this step than the tentative target from + step 3, depending on the SetTargetAPZC notification. Processing the + events may trigger behaviours like scrolling or tap gestures. + +If the CSS touch-action property is enabled, the above steps are +modified as follows: \* In step 4, the APZC also requires the allowed +touch-action behaviours for the input event. This might have been +determined as part of the hit-test in APZCTreeManager; if not, the +events are queued. \* In step 6, the widget code determines the content +element at the point under the input element, and notifies the APZ code +of the allowed touch-action behaviours. This notification is sent via a +call to APZCTreeManager::SetAllowedTouchBehavior on the input thread. \* +In step 9(ii), the input block will only be marked ready for processing +once all three notifications arrive. + +Threading considerations +^^^^^^^^^^^^^^^^^^^^^^^^ + +The bulk of the input processing in the APZ code happens on what we call +“the input thread”. In practice the input thread could be the Gecko main +thread, the compositor thread, or some other thread. There are obvious +downsides to using the Gecko main thread - that is, “asynchronous” +panning and zooming is not really asynchronous as input events can only +be processed while Gecko is idle. In an e10s environment, using the +Gecko main thread of the chrome process is acceptable, because the code +running in that process is more controllable and well-behaved than +arbitrary web content. Using the compositor thread as the input thread +could work on some platforms, but may be inefficient on others. For +example, on Android (Fennec) we receive input events from the system on +a dedicated UI thread. We would have to redispatch the input events to +the compositor thread if we wanted to the input thread to be the same as +the compositor thread. This introduces a potential for higher latency, +particularly if the compositor does any blocking operations - blocking +SwapBuffers operations, for example. As a result, the APZ code itself +does not assume that the input thread will be the same as the Gecko main +thread or the compositor thread. + +Active vs. inactive scrollframes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The number of scrollframes on a page is potentially unbounded. However, +we do not want to create a separate layer for each scrollframe right +away, as this would require large amounts of memory. Therefore, +scrollframes as designated as either “active” or “inactive”. Active +scrollframes are the ones that do have their contents put on a separate +layer (or set of layers), and inactive ones do not. + +Consider a page with a scrollframe that is initially inactive. When +layout generates the layers for this page, the content of the +scrollframe will be flattened into some other PaintedLayer (call it P). +The layout code also adds the area (or bounding region in case of weird +shapes) of the scrollframe to the dispatch-to-content region of P. + +When the user starts interacting with that content, the hit-test in the +APZ code finds the dispatch-to-content region of P. The input block +therefore has a tentative target of P when it goes into step 4(ii) in +the flow above. When gecko processes the input event, it must detect the +inactive scrollframe and activate it, as part of step 7. Finally, the +widget code sends the SetTargetAPZC notification in step 8 to notify the +APZ that the input block should really apply to this new layer. The +issue here is that the layer transaction containing the new layer must +reach the compositor and APZ before the SetTargetAPZC notification. If +this does not occur within the 400ms timeout, the APZ code will be +unable to update the tentative target, and will continue to use P for +that input block. Input blocks that start after the layer transaction +will get correctly routed to the new layer as there will now be a layer +and APZC instance for the active scrollframe. + +This model implies that when the user initially attempts to scroll an +inactive scrollframe, it may end up scrolling an ancestor scrollframe. +(This is because in the absence of the SetTargetAPZC notification, the +input events will get applied to the closest ancestor scrollframe’s +APZC.) Only after the round-trip to the gecko thread is complete is +there a layer for async scrolling to actually occur on the scrollframe +itself. At that point the scrollframe will start receiving new input +blocks and will scroll normally. diff --git a/gfx/doc/AsyncPanZoom-HighLevel.png b/gfx/docs/AsyncPanZoomArchitecture.png similarity index 100% rename from gfx/doc/AsyncPanZoom-HighLevel.png rename to gfx/docs/AsyncPanZoomArchitecture.png diff --git a/gfx/docs/GraphicsOverview.rst b/gfx/docs/GraphicsOverview.rst new file mode 100644 index 000000000000..847762257507 --- /dev/null +++ b/gfx/docs/GraphicsOverview.rst @@ -0,0 +1,159 @@ +Graphics Overview +========================= + +Work in progress. Possibly incorrect or incomplete. +--------------------------------------------------- + +Jargon +------ + +There's a lot of jargon in the graphics stack. We try to maintain a list +of common words and acronyms `here `__. + +Overview +-------- + +The graphics systems is responsible for rendering (painting, drawing) +the frame tree (rendering tree) elements as created by the layout +system. Each leaf in the tree has content, either bounded by a rectangle +(or perhaps another shape, in the case of SVG.) + +The simple approach for producing the result would thus involve +traversing the frame tree, in a correct order, drawing each frame into +the resulting buffer and displaying (printing non-withstanding) that +buffer when the traversal is done. It is worth spending some time on the +“correct order” note above. If there are no overlapping frames, this is +fairly simple - any order will do, as long as there is no background. If +there is background, we just have to worry about drawing that first. +Since we do not control the content, chances are the page is more +complicated. There are overlapping frames, likely with transparency, so +we need to make sure the elements are draw “back to front”, in layers, +so to speak. Layers are an important concept, and we will revisit them +shortly, as they are central to fixing a major issue with the above +simple approach. + +While the above simple approach will work, the performance will suffer. +Each time anything changes in any of the frames, the complete process +needs to be repeated, everything needs to be redrawn. Further, there is +very little space to take advantage of the modern graphics (GPU) +hardware, or multi-core computers. If you recall from the previous +sections, the frame tree is only accessible from the UI thread, so while +we’re doing all this work, the UI is basically blocked. + +(Retained) Layers +~~~~~~~~~~~~~~~~~ + +Layers framework was introduced to address the above performance issues, +by having a part of the design address each item. At the high level: + +1. We create a layer tree. The leaf elements of the tree contain all + frames (possibly multiple frames per leaf). +2. We render each layer tree element and cache (retain) the result. +3. We composite (combine) all the leaf elements into the final result. + +Let’s examine each of these steps, in reverse order. + +Compositing +~~~~~~~~~~~ + +We use the term composite as it implies that the order is important. If +the elements being composited overlap, whether there is transparency +involved or not, the order in which they are combined will effect the +result. Compositing is where we can use some of the power of the modern +graphics hardware. It is optimal for doing this job. In the scenarios +where only the position of individual frames changes, without the +content inside them changing, we see why caching each layer would be +advantageous - we only need to repeat the final compositing step, +completely skipping the layer tree creation and the rendering of each +leaf, thus speeding up the process considerably. + +Another benefit is equally apparent in the context of the stated +deficiencies of the simple approach. We can use the available graphics +hardware accelerated APIs to do the compositing step. Direct3D, OpenGL +can be used on different platforms and are well suited to accelerate +this step. + +Finally, we can now envision performing the compositing step on a +separate thread, unblocking the UI thread for other work, and doing more +work in parallel. More on this below. + +It is important to note that the number of operations in this step is +proportional to the number of layer tree (leaf) elements, so there is +additional work and complexity involved, when the layer tree is large. + +Render and retain layer elements +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As we saw, the compositing step benefits from caching the intermediate +result. This does result in the extra memory usage, so needs to be +considered during the layer tree creation. Beyond the caching, we can +accelerate the rendering of each element by (indirectly) using the +available platform APIs (e.g., Direct2D, CoreGraphics, even some of the +3D APIs like OpenGL or Direct3D) as available. This is actually done +through a platform independent API (see Moz2D) below, but is important +to realize it does get accelerated appropriately. + +Creating the layer tree +~~~~~~~~~~~~~~~~~~~~~~~ + +We need to create a layer tree (from the frames tree), which will give +us the correct result while striking the right balance between a layer +per frame element and a single layer for the complete frames tree. As +was mentioned above, there is an overhead in traversing the whole tree +and caching each of the elements, balanced by the performance +improvements. Some of the performance improvements are only noticed when +something changes (e.g., one element is moving, we only need to redo the +compositing step). + +Refresh Driver +~~~~~~~~~~~~~~ + +Layers +~~~~~~ + +Rendering each layer +~~~~~~~~~~~~~~~~~~~~ + +Tiling vs. Buffer Rotation vs. Full paint +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Compositing for the final result +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Graphics API +~~~~~~~~~~~~ + +Moz2D +~~~~~ + +- The Moz2D graphics API, part of the Azure project, is a + cross-platform interface onto the various graphics backends that + Gecko uses for rendering such as Direct2D (1.0 and 1.1), Skia, Cairo, + Quartz, and NV Path. Adding a new graphics platform to Gecko is + accomplished by adding a backend to Moz2D. + See `Moz2D documentation on wiki `__. + +Compositing +~~~~~~~~~~~ + +Image Decoding +~~~~~~~~~~~~~~ + +Image Animation +~~~~~~~~~~~~~~~ + +`Historical Documents `__ +--------------------------------------------------------------------- + +A number of posts and blogs that will give you more details or more +background, or reasoning that led to different solutions and approaches. + +- 2010-01 `Layers: Cross Platform Acceleration `__ +- 2010-04 `Layers `__ +- 2010-07 `Retained Layers `__ +- 2011-04 `Introduction `__ +- 2011-07 `Layers `__ +- 2011-09 `Graphics API Design `__ +- 2012-04 `Moz2D Canvas on OSX `__ +- 2012-05 `Mask Layers `__ +- 2013-07 `Graphics related `__ diff --git a/gfx/docs/LayersHistory.rst b/gfx/docs/LayersHistory.rst new file mode 100644 index 000000000000..360df9b37dc4 --- /dev/null +++ b/gfx/docs/LayersHistory.rst @@ -0,0 +1,63 @@ +Layers History +============== + +This is an overview of the major events in the history of our Layers +infrastructure. + +- iPhone released in July 2007 (Built on a toolkit called LayerKit) + +- Core Animation (October 2007) LayerKit was publicly renamed to OS X + 10.5 + +- Webkit CSS 3d transforms (July 2009) + +- Original layers API (March 2010) Introduced the idea of a layer + manager that would composite. One of the first use cases for this was + hardware accelerated YUV conversion for video. + +- Retained layers (July 7 2010 - Bug 564991) This was an important + concept that introduced the idea of persisting the layer content + across paints in gecko controlled buffers instead of just by the OS. + This introduced the concept of buffer rotation to deal with scrolling + instead of using the native scrolling APIs like ScrollWindowEx + +- Layers IPC (July 2010 - Bug 570294) This introduced shadow layers and + edit lists and was originally done for e10s v1 + +- 3D transforms (September 2011 - Bug 505115) + +- OMTC (December 2011 - Bug 711168) This was prototyped on OS X but + shipped first for Fennec + +- Tiling v1 (April 2012 - Bug 739679) Originally done for Fennec. This + was done to avoid situations where we had to do a bunch of work for + scrolling a small amount. i.e. buffer rotation. It allowed us to have + a variety of interesting features like progressive painting and lower + resolution painting. + +- C++ Async pan zoom controller (July 2012 - Bug 750974) The existing + APZ code was in Java for Fennec so this was reimplemented. + +- Streaming WebGL Buffers (February 2013 - Bug 716859) Infrastructure + to allow OMTC WebGL and avoid the need to glFinish() every frame. + +- Compositor API (April 2013 - Bug 825928) The planning for this + started around November 2012. Layers refactoring created a compositor + API that abstracted away the differences between the D3D vs OpenGL. + The main piece of API is DrawQuad. + +- Tiling v2 (Mar 7 2014 - Bug 963073) Tiling for B2G. This work is + mainly porting tiled layers to new textures, implementing + double-buffered tiles and implementing a texture client pool, to be + used by tiled content clients. + + A large motivation for the pool was the very slow performance of + allocating tiles because of the sync messages to the compositor. + + The slow performance of allocating was directly addressed by bug 959089 + which allowed us to allocate gralloc buffers without sync messages to + the compositor thread. + +- B2G WebGL performance (May 2014 - Bug 1006957, 1001417, 1024144) This + work improved the synchronization mechanism between the compositor + and the producer. diff --git a/gfx/docs/Silk.rst b/gfx/docs/Silk.rst new file mode 100644 index 000000000000..01b14a35ca23 --- /dev/null +++ b/gfx/docs/Silk.rst @@ -0,0 +1,469 @@ +Silk Overview +========================== + +.. image:: SilkArchitecture.png + +Architecture +------------ + +Our current architecture is to align three components to hardware vsync +timers: + +1. Compositor +2. RefreshDriver / Painting +3. Input Events + +The flow of our rendering engine is as follows: + +1. Hardware Vsync event occurs on an OS specific *Hardware Vsync Thread* + on a per monitor basis. +2. The *Hardware Vsync Thread* attached to the monitor notifies the + ``CompositorVsyncDispatchers`` and ``RefreshTimerVsyncDispatcher``. +3. For every Firefox window on the specific monitor, notify a + ``CompositorVsyncDispatcher``. The ``CompositorVsyncDispatcher`` is + specific to one window. +4. The ``CompositorVsyncDispatcher`` notifies a + ``CompositorWidgetVsyncObserver`` when remote compositing, or a + ``CompositorVsyncScheduler::Observer`` when compositing in-process. +5. If remote compositing, a vsync notification is sent from the + ``CompositorWidgetVsyncObserver`` to the ``VsyncBridgeChild`` on the + UI process, which sends an IPDL message to the ``VsyncBridgeParent`` + on the compositor thread of the GPU process, which then dispatches to + ``CompositorVsyncScheduler::Observer``. +6. The ``RefreshTimerVsyncDispatcher`` notifies the Chrome + ``RefreshTimer`` that a vsync has occured. +7. The ``RefreshTimerVsyncDispatcher`` sends IPC messages to all content + processes to tick their respective active ``RefreshTimer``. +8. The ``Compositor`` dispatches input events on the *Compositor + Thread*, then composites. Input events are only dispatched on the + *Compositor Thread* on b2g. +9. The ``RefreshDriver`` paints on the *Main Thread*. + +Hardware Vsync +-------------- + +Hardware vsync events from (1), occur on a specific ``Display`` Object. +The ``Display`` object is responsible for enabling / disabling vsync on +a per connected display basis. For example, if two monitors are +connected, two ``Display`` objects will be created, each listening to +vsync events for their respective displays. We require one ``Display`` +object per monitor as each monitor may have different vsync rates. As a +fallback solution, we have one global ``Display`` object that can +synchronize across all connected displays. The global ``Display`` is +useful if a window is positioned halfway between the two monitors. Each +platform will have to implement a specific ``Display`` object to hook +and listen to vsync events. As of this writing, both Firefox OS and OS X +create their own hardware specific *Hardware Vsync Thread* that executes +after a vsync has occured. OS X creates one *Hardware Vsync Thread* per +``CVDisplayLinkRef``. We do not currently support multiple displays, so +we use one global ``CVDisplayLinkRef`` that works across all active +displays. On Windows, we have to create a new platform ``thread`` that +waits for DwmFlush(), which works across all active displays. Once the +thread wakes up from DwmFlush(), the actual vsync timestamp is retrieved +from DwmGetCompositionTimingInfo(), which is the timestamp that is +actually passed into the compositor and refresh driver. + +When a vsync occurs on a ``Display``, the *Hardware Vsync Thread* +callback fetches all ``CompositorVsyncDispatchers`` associated with the +``Display``. Each ``CompositorVsyncDispatcher`` is notified that a vsync +has occured with the vsync’s timestamp. It is the responsibility of the +``CompositorVsyncDispatcher`` to notify the ``Compositor`` that is +awaiting vsync notifications. The ``Display`` will then notify the +associated ``RefreshTimerVsyncDispatcher``, which should notify all +active ``RefreshDrivers`` to tick. + +All ``Display`` objects are encapsulated in a ``VsyncSource`` object. +The ``VsyncSource`` object lives in ``gfxPlatform`` and is instantiated +only on the parent process when ``gfxPlatform`` is created. The +``VsyncSource`` is destroyed when ``gfxPlatform`` is destroyed. There is +only one ``VsyncSource`` object throughout the entire lifetime of +Firefox. Each platform is expected to implement their own +``VsyncSource`` to manage vsync events. On Firefox OS, this is through +the ``HwcComposer2D``. On OS X, this is through ``CVDisplayLinkRef``. On +Windows, it should be through ``DwmGetCompositionTimingInfo``. + +Compositor +---------- + +When the ``CompositorVsyncDispatcher`` is notified of the vsync event, +the ``CompositorVsyncScheduler::Observer`` associated with the +``CompositorVsyncDispatcher`` begins execution. Since the +``CompositorVsyncDispatcher`` executes on the *Hardware Vsync Thread* +and the ``Compositor`` composites on the ``CompositorThread``, the +``CompositorVsyncScheduler::Observer`` posts a task to the +``CompositorThread``. The ``CompositorBridgeParent`` then composites. +The model where the ``CompositorVsyncDispatcher`` notifies components on +the *Hardware Vsync Thread*, and the component schedules the task on the +appropriate thread is used everywhere. + +The ``CompositorVsyncScheduler::Observer`` listens to vsync events as +needed and stops listening to vsync when composites are no longer +scheduled or required. Every ``CompositorBridgeParent`` is associated +and tied to one ``CompositorVsyncScheduler::Observer``, which is +associated with the ``CompositorVsyncDispatcher``. Each +``CompositorBridgeParent`` is associated with one widget and is created +when a new platform window or ``nsBaseWidget`` is created. The +``CompositorBridgeParent``, ``CompositorVsyncDispatcher``, +``CompositorVsyncScheduler::Observer``, and ``nsBaseWidget`` all have +the same lifetimes, which are created and destroyed together. + +Out-of-process Compositors +-------------------------- + +When compositing out-of-process, this model changes slightly. In this +case there are effectively two observers: a UI process observer +(``CompositorWidgetVsyncObserver``), and the +``CompositorVsyncScheduler::Observer`` in the GPU process. There are +also two dispatchers: the widget dispatcher in the UI process +(``CompositorVsyncDispatcher``), and the IPDL-based dispatcher in the +GPU process (``CompositorBridgeParent::NotifyVsync``). The UI process +observer and the GPU process dispatcher are linked via an IPDL protocol +called PVsyncBridge. ``PVsyncBridge`` is a top-level protocol for +sending vsync notifications to the compositor thread in the GPU process. +The compositor controls vsync observation through a separate actor, +``PCompositorWidget``, which (as a subactor for +``CompositorBridgeChild``) links the compositor thread in the GPU +process to the main thread in the UI process. + +Out-of-process compositors do not go through +``CompositorVsyncDispatcher`` directly. Instead, the +``CompositorWidgetDelegate`` in the UI process creates one, and gives it +a ``CompositorWidgetVsyncObserver``. This observer forwards +notifications to a Vsync I/O thread, where ``VsyncBridgeChild`` then +forwards the notification again to the compositor thread in the GPU +process. The notification is received by a ``VsyncBridgeParent``. The +GPU process uses the layers ID in the notification to find the correct +compositor to dispatch the notification to. + +CompositorVsyncDispatcher +------------------------- + +The ``CompositorVsyncDispatcher`` executes on the *Hardware Vsync +Thread*. It contains references to the ``nsBaseWidget`` it is associated +with and has a lifetime equal to the ``nsBaseWidget``. The +``CompositorVsyncDispatcher`` is responsible for notifying the +``CompositorBridgeParent`` that a vsync event has occured. There can be +multiple ``CompositorVsyncDispatchers`` per ``Display``, one +``CompositorVsyncDispatcher`` per window. The only responsibility of the +``CompositorVsyncDispatcher`` is to notify components when a vsync event +has occured, and to stop listening to vsync when no components require +vsync events. We require one ``CompositorVsyncDispatcher`` per window so +that we can handle multiple ``Displays``. When compositing in-process, +the ``CompositorVsyncDispatcher`` is attached to the CompositorWidget +for the window. When out-of-process, it is attached to the +CompositorWidgetDelegate, which forwards observer notifications over +IPDL. In the latter case, its lifetime is tied to a CompositorSession +rather than the nsIWidget. + +Multiple Displays +----------------- + +The ``VsyncSource`` has an API to switch a ``CompositorVsyncDispatcher`` +from one ``Display`` to another ``Display``. For example, when one +window either goes into full screen mode or moves from one connected +monitor to another. When one window moves to another monitor, we expect +a platform specific notification to occur. The detection of when a +window enters full screen mode or moves is not covered by Silk itself, +but the framework is built to support this use case. The expected flow +is that the OS notification occurs on ``nsIWidget``, which retrieves the +associated ``CompositorVsyncDispatcher``. The +``CompositorVsyncDispatcher`` then notifies the ``VsyncSource`` to +switch to the correct ``Display`` the ``CompositorVsyncDispatcher`` is +connected to. Because the notification works through the ``nsIWidget``, +the actual switching of the ``CompositorVsyncDispatcher`` to the correct +``Display`` should occur on the *Main Thread*. The current +implementation of Silk does not handle this case and needs to be built +out. + +CompositorVsyncScheduler::Observer +---------------------------------- + +The ``CompositorVsyncScheduler::Observer`` handles the vsync +notifications and interactions with the ``CompositorVsyncDispatcher``. +When the ``Compositor`` requires a scheduled composite, it notifies the +``CompositorVsyncScheduler::Observer`` that it needs to listen to vsync. +The ``CompositorVsyncScheduler::Observer`` then observes / unobserves +vsync as needed from the ``CompositorVsyncDispatcher`` to enable +composites. + +GeckoTouchDispatcher +-------------------- + +The ``GeckoTouchDispatcher`` is a singleton that resamples touch events +to smooth out jank while tracking a user’s finger. Because input and +composite are linked together, the +``CompositorVsyncScheduler::Observer`` has a reference to the +``GeckoTouchDispatcher`` and vice versa. + +Input Events +------------ + +One large goal of Silk is to align touch events with vsync events. On +Firefox OS, touchscreens often have different touch scan rates than the +display refreshes. A Flame device has a touch refresh rate of 75 HZ, +while a Nexus 4 has a touch refresh rate of 100 HZ, while the device’s +display refresh rate is 60HZ. When a vsync event occurs, we resample +touch events, and then dispatch the resampled touch event to APZ. Touch +events on Firefox OS occur on a *Touch Input Thread* whereas they are +processed by APZ on the *APZ Controller Thread*. We use `Google +Android’s touch +resampling `__ +algorithm to resample touch events. + +Currently, we have a strict ordering between Composites and touch +events. When a touch event occurs on the *Touch Input Thread*, we store +the touch event in a queue. When a vsync event occurs, the +``CompositorVsyncDispatcher`` notifies the ``Compositor`` of a vsync +event, which notifies the ``GeckoTouchDispatcher``. The +``GeckoTouchDispatcher`` processes the touch event first on the *APZ +Controller Thread*, which is the same as the *Compositor Thread* on b2g, +then the ``Compositor`` finishes compositing. We require this strict +ordering because if a vsync notification is dispatched to both the +``Compositor`` and ``GeckoTouchDispatcher`` at the same time, a race +condition occurs between processing the touch event and therefore +position versus compositing. In practice, this creates very janky +scrolling. As of this writing, we have not analyzed input events on +desktop platforms. + +One slight quirk is that input events can start a composite, for example +during a scroll and after the ``Compositor`` is no longer listening to +vsync events. In these cases, we notify the ``Compositor`` to observe +vsync so that it dispatches touch events. If touch events were not +dispatched, and since the ``Compositor`` is not listening to vsync +events, the touch events would never be dispatched. The +``GeckoTouchDispatcher`` handles this case by always forcing the +``Compositor`` to listen to vsync events while touch events are +occurring. + +Widget, Compositor, CompositorVsyncDispatcher, GeckoTouchDispatcher Shutdown Procedure +-------------------------------------------------------------------------------------- + +When the `nsBaseWidget shuts +down `__ +- It calls nsBaseWidget::DestroyCompositor on the *Gecko Main Thread*. +During nsBaseWidget::DestroyCompositor, it first destroys the +CompositorBridgeChild. CompositorBridgeChild sends a sync IPC call to +CompositorBridgeParent::RecvStop, which calls +`CompositorBridgeParent::Destroy `__. +During this time, the *main thread* is blocked on the parent process. +CompositorBridgeParent::RecvStop runs on the *Compositor thread* and +cleans up some resources, including setting the +``CompositorVsyncScheduler::Observer`` to nullptr. +CompositorBridgeParent::RecvStop also explicitly keeps the +CompositorBridgeParent alive and posts another task to run +CompositorBridgeParent::DeferredDestroy on the Compositor loop so that +all ipdl code can finish executing. The +``CompositorVsyncScheduler::Observer`` also unobserves from vsync and +cancels any pending composite tasks. Once +CompositorBridgeParent::RecvStop finishes, the *main thread* in the +parent process continues shutting down the nsBaseWidget. + +At the same time, the *Compositor thread* is executing tasks until +CompositorBridgeParent::DeferredDestroy runs, which flushes the +compositor message loop. Now we have two tasks as both the nsBaseWidget +releases a reference to the Compositor on the *main thread* during +destruction and the CompositorBridgeParent::DeferredDestroy releases a +reference to the CompositorBridgeParent on the *Compositor Thread*. +Finally, the CompositorBridgeParent itself is destroyed on the *main +thread* once both references are gone due to explicit `main thread +destruction `__. + +With the ``CompositorVsyncScheduler::Observer``, any accesses to the +widget after nsBaseWidget::DestroyCompositor executes are invalid. Any +accesses to the compositor between the time the +nsBaseWidget::DestroyCompositor runs and the +CompositorVsyncScheduler::Observer’s destructor runs aren’t safe yet a +hardware vsync event could occur between these times. Since any tasks +posted on the Compositor loop after +CompositorBridgeParent::DeferredDestroy is posted are invalid, we make +sure that no vsync tasks can be posted once +CompositorBridgeParent::RecvStop executes and DeferredDestroy is posted +on the Compositor thread. When the sync call to +CompositorBridgeParent::RecvStop executes, we explicitly set the +CompositorVsyncScheduler::Observer to null to prevent vsync +notifications from occurring. If vsync notifications were allowed to +occur, since the ``CompositorVsyncScheduler::Observer``\ ’s vsync +notification executes on the *hardware vsync thread*, it would post a +task to the Compositor loop and may execute after +CompositorBridgeParent::DeferredDestroy. Thus, we explicitly shut down +vsync events in the ``CompositorVsyncDispatcher`` and +``CompositorVsyncScheduler::Observer`` during nsBaseWidget::Shutdown to +prevent any vsync tasks from executing after +CompositorBridgeParent::DeferredDestroy. + +The ``CompositorVsyncDispatcher`` may be destroyed on either the *main +thread* or *Compositor Thread*, since both the nsBaseWidget and +``CompositorVsyncScheduler::Observer`` race to destroy on different +threads. nsBaseWidget is destroyed on the *main thread* and releases a +reference to the ``CompositorVsyncDispatcher`` during destruction. The +``CompositorVsyncScheduler::Observer`` has a race to be destroyed either +during CompositorBridgeParent shutdown or from the +``GeckoTouchDispatcher`` which is destroyed on the main thread with +`ClearOnShutdown `__. +Whichever object, the CompositorBridgeParent or the +``GeckoTouchDispatcher`` is destroyed last will hold the last reference +to the ``CompositorVsyncDispatcher``, which destroys the object. + +Refresh Driver +-------------- + +The Refresh Driver is ticked from a `single active +timer `__. +The assumption is that there are multiple ``RefreshDrivers`` connected +to a single ``RefreshTimer``. There are two ``RefreshTimers``: an active +and an inactive ``RefreshTimer``. Each Tab has its own +``RefreshDriver``, which connects to one of the global +``RefreshTimers``. The ``RefreshTimers`` execute on the *Main Thread* +and tick their connected ``RefreshDrivers``. We do not want to break +this model of multiple ``RefreshDrivers`` per a set of two global +``RefreshTimers``. Each ``RefreshDriver`` switches between the active +and inactive ``RefreshTimer``. + +Instead, we create a new ``RefreshTimer``, the ``VsyncRefreshTimer`` +which ticks based on vsync messages. We replace the current active timer +with a ``VsyncRefreshTimer``. All tabs will then tick based on this new +active timer. Since the ``RefreshTimer`` has a lifetime of the process, +we only need to create a single ``RefreshTimerVsyncDispatcher`` per +``Display`` when Firefox starts. Even if we do not have any content +processes, the Chrome process will still need a ``VsyncRefreshTimer``, +thus we can associate the ``RefreshTimerVsyncDispatcher`` with each +``Display``. + +When Firefox starts, we initially create a new ``VsyncRefreshTimer`` in +the Chrome process. The ``VsyncRefreshTimer`` will listen to vsync +notifications from ``RefreshTimerVsyncDispatcher`` on the global +``Display``. When nsRefreshDriver::Shutdown executes, it will delete the +``VsyncRefreshTimer``. This creates a problem as all the +``RefreshTimers`` are currently manually memory managed whereas +``VsyncObservers`` are ref counted. To work around this problem, we +create a new ``RefreshDriverVsyncObserver`` as an inner class to +``VsyncRefreshTimer``, which actually receives vsync notifications. It +then ticks the ``RefreshDrivers`` inside ``VsyncRefreshTimer``. + +With Content processes, the start up process is more complicated. We +send vsync IPC messages via the use of the PBackground thread on the +parent process, which allows us to send messages from the Parent +process’ without waiting on the *main thread*. This sends messages from +the Parent::\ *PBackground Thread* to the Child::\ *Main Thread*. The +*main thread* receiving IPC messages on the content process is +acceptable because ``RefreshDrivers`` must execute on the *main thread*. +However, there is some amount of time required to setup the IPC +connection upon process creation and during this time, the +``RefreshDrivers`` must tick to set up the process. To get around this, +we initially use software ``RefreshTimers`` that already exist during +content process startup and swap in the ``VsyncRefreshTimer`` once the +IPC connection is created. + +During nsRefreshDriver::ChooseTimer, we create an async PBackground IPC +open request to create a ``VsyncParent`` and ``VsyncChild``. At the same +time, we create a software ``RefreshTimer`` and tick the +``RefreshDrivers`` as normal. Once the PBackground callback is executed +and an IPC connection exists, we swap all ``RefreshDrivers`` currently +associated with the active ``RefreshTimer`` and swap the +``RefreshDrivers`` to use the ``VsyncRefreshTimer``. Since all +interactions on the content process occur on the main thread, there are +no need for locks. The ``VsyncParent`` listens to vsync events through +the ``VsyncRefreshTimerDispatcher`` on the parent side and sends vsync +IPC messages to the ``VsyncChild``. The ``VsyncChild`` notifies the +``VsyncRefreshTimer`` on the content process. + +During the shutdown process of the content process, ActorDestroy is +called on the ``VsyncChild`` and ``VsyncParent`` due to the normal +PBackground shutdown process. Once ActorDestroy is called, no IPC +messages should be sent across the channel. After ActorDestroy is +called, the IPDL machinery will delete the **VsyncParent/Child** pair. +The ``VsyncParent``, due to being a ``VsyncObserver``, is ref counted. +After ``VsyncParent::ActorDestroy`` is called, it unregisters itself +from the ``RefreshTimerVsyncDispatcher``, which holds the last reference +to the ``VsyncParent``, and the object will be deleted. + +Thus the overall flow during normal execution is: + +1. VsyncSource::Display::RefreshTimerVsyncDispatcher receives a Vsync + notification from the OS in the parent process. +2. RefreshTimerVsyncDispatcher notifies + VsyncRefreshTimer::RefreshDriverVsyncObserver that a vsync occured on + the parent process on the hardware vsync thread. +3. RefreshTimerVsyncDispatcher notifies the VsyncParent on the hardware + vsync thread that a vsync occured. +4. The VsyncRefreshTimer::RefreshDriverVsyncObserver in the parent + process posts a task to the main thread that ticks the refresh + drivers. +5. VsyncParent posts a task to the PBackground thread to send a vsync + IPC message to VsyncChild. +6. VsyncChild receive a vsync notification on the content process on the + main thread and ticks their respective RefreshDrivers. + +Compressing Vsync Messages +-------------------------- + +Vsync messages occur quite often and the *main thread* can be busy for +long periods of time due to JavaScript. Consistently sending vsync +messages to the refresh driver timer can flood the *main thread* with +refresh driver ticks, causing even more delays. To avoid this problem, +we compress vsync messages on both the parent and child processes. + +On the parent process, newer vsync messages update a vsync timestamp but +do not actually queue any tasks on the *main thread*. Once the parent +process’ *main thread* executes the refresh driver tick, it uses the +most updated vsync timestamp to tick the refresh driver. After the +refresh driver has ticked, one single vsync message is queued for +another refresh driver tick task. On the content process, the IPDL +``compress`` keyword automatically compresses IPC messages. + +Multiple Monitors +----------------- + +In order to have multiple monitor support for the ``RefreshDrivers``, we +have multiple active ``RefreshTimers``. Each ``RefreshTimer`` is +associated with a specific ``Display`` via an id and tick when it’s +respective ``Display`` vsync occurs. We have **N RefreshTimers**, where +N is the number of connected displays. Each ``RefreshTimer`` still has +multiple ``RefreshDrivers``. + +When a tab or window changes monitors, the ``nsIWidget`` receives a +display changed notification. Based on which display the window is on, +the window switches to the correct ``RefreshTimerVsyncDispatcher`` and +``CompositorVsyncDispatcher`` on the parent process based on the display +id. Each ``TabParent`` should also send a notification to their child. +Each ``TabChild``, given the display ID, switches to the correct +``RefreshTimer`` associated with the display ID. When each display vsync +occurs, it sends one IPC message to notify vsync. The vsync message +contains a display ID, to tick the appropriate ``RefreshTimer`` on the +content process. There is still only one **VsyncParent/VsyncChild** +pair, just each vsync notification will include a display ID, which maps +to the correct ``RefreshTimer``. + +Object Lifetime +--------------- + +1. CompositorVsyncDispatcher - Lives as long as the nsBaseWidget + associated with the VsyncDispatcher +2. CompositorVsyncScheduler::Observer - Lives and dies the same time as + the CompositorBridgeParent. +3. RefreshTimerVsyncDispatcher - As long as the associated display + object, which is the lifetime of Firefox. +4. VsyncSource - Lives as long as the gfxPlatform on the chrome process, + which is the lifetime of Firefox. +5. VsyncParent/VsyncChild - Lives as long as the content process +6. RefreshTimer - Lives as long as the process + +Threads +------- + +All ``VsyncObservers`` are notified on the *Hardware Vsync Thread*. It +is the responsibility of the ``VsyncObservers`` to post tasks to their +respective correct thread. For example, the +``CompositorVsyncScheduler::Observer`` will be notified on the *Hardware +Vsync Thread*, and post a task to the *Compositor Thread* to do the +actual composition. + +1. Compositor Thread - Nothing changes +2. Main Thread - PVsyncChild receives IPC messages on the main thread. + We also enable/disable vsync on the main thread. +3. PBackground Thread - Creates a connection from the PBackground thread + on the parent process to the main thread in the content process. +4. Hardware Vsync Thread - Every platform is different, but we always + have the concept of a hardware vsync thread. Sometimes this is + actually created by the host OS. On Windows, we have to create a + separate platform thread that blocks on DwmFlush(). diff --git a/gfx/doc/silkArchitecture.png b/gfx/docs/SilkArchitecture.png similarity index 100% rename from gfx/doc/silkArchitecture.png rename to gfx/docs/SilkArchitecture.png diff --git a/gfx/docs/index.rst b/gfx/docs/index.rst index c6621c81ab24..5b0a918fd1fa 100644 --- a/gfx/docs/index.rst +++ b/gfx/docs/index.rst @@ -1,9 +1,17 @@ -======== Graphics ======== -The graphics team's documentation is currently using doxygen. We're tracking the work to integrate it better at https://bugzilla.mozilla.org/show_bug.cgi?id=1150232. +This collection of linked pages contains design documents for the +Mozilla graphics architecture. The design documents live in gfx/docs directory. -For now you can read the graphics source code documentation here: +This `wiki page `__ contains +information about graphics and the graphics team at Mozilla. -http://people.mozilla.org/~bgirard/doxygen/gfx/ +.. toctree:: + :maxdepth: 1 + + GraphicsOverview + LayersHistory + AsyncPanZoom + AdvancedLayers + Silk