gecko-dev/taskcluster/docs/taskgraph.rst
Gregory Szorc 0e12f1cc60 Bug 1318200 - Introduce task graph filtering; r=dustin
Previously, we ran a single "target task" function to mutate the full
task graph into a subset based on input parameters (try syntax,
repository being built for, etc). This concept is useful. But
the implementation was limiting because we could only have a single
"target tasks" function.

This commit introduces the concept of "filters." They conceptually
do the same thing as "target tasks methods" but you can run more than
1 of them.

Filters are simply functions that examine an input graph+parameters
and emit nodes that should be retained. Filters, like target tasks
methods, are defined via decorated functions in a module.

TaskGraphGenerator has been converted to use filters. The list of
defined filters can be defined in the parameters dict passed into
TaskGraphGenerator. A default filter list is provided in decision.py.

The intent is to eventually convert target tasks to filters. Until
that happens, we always run the registered target tasks method via
a filter proxy function.

No new tests have been added because we don't yet have any
functionality relying explicitly on filters. Tests will be added in
a subsequent commit once we add a new filter.

While I was here, I also snuck in some logging on the size of the
graphs.

MozReview-Commit-ID: ERn2hIYbMRp

--HG--
extra : rebase_source : 36b8e86aa64b2f52b03b31b5497759b0009fb921
2016-11-17 15:53:30 -08:00

277 lines
12 KiB
ReStructuredText

======================
TaskGraph Mach Command
======================
The task graph is built by linking different kinds of tasks together, pruning
out tasks that are not required, then optimizing by replacing subgraphs with
links to already-completed tasks.
Concepts
--------
* *Task Kind* - Tasks are grouped by kind, where tasks of the same kind do not
have interdependencies but have substantial similarities, and may depend on
tasks of other kinds. Kinds are the primary means of supporting diversity,
in that a developer can add a new kind to do just about anything without
impacting other kinds.
* *Task Attributes* - Tasks have string attributes by which can be used for
filtering. Attributes are documented in :doc:`attributes`.
* *Task Labels* - Each task has a unique identifier within the graph that is
stable across runs of the graph generation algorithm. Labels are replaced
with TaskCluster TaskIds at the latest time possible, facilitating analysis
of graphs without distracting noise from randomly-generated taskIds.
* *Optimization* - replacement of a task in a graph with an equivalent,
already-completed task, or a null task, avoiding repetition of work.
Kinds
-----
Kinds are the focal point of this system. They provide an interface between
the large-scale graph-generation process and the small-scale task-definition
needs of different kinds of tasks. Each kind may implement task generation
differently. Some kinds may generate task definitions entirely internally (for
example, symbol-upload tasks are all alike, and very simple), while other kinds
may do little more than parse a directory of YAML files.
A ``kind.yml`` file contains data about the kind, as well as referring to a
Python class implementing the kind in its ``implementation`` key. That
implementation may rely on lots of code shared with other kinds, or contain a
completely unique implementation of some functionality.
The full list of pre-defined keys in this file is:
``implementation``
Class implementing this kind, in the form ``<module-path>:<object-path>``.
This class should be a subclass of ``taskgraph.kind.base:Kind``.
``kind-dependencies``
Kinds which should be loaded before this one. This is useful when the kind
will use the list of already-created tasks to determine which tasks to
create, for example adding an upload-symbols task after every build task.
Any other keys are subject to interpretation by the kind implementation.
The result is a nice segmentation of implementation so that the more esoteric
in-tree projects can do their crazy stuff in an isolated kind without making
the bread-and-butter build and test configuration more complicated.
Dependencies
------------
Dependencies between tasks are represented as labeled edges in the task graph.
For example, a test task must depend on the build task creating the artifact it
tests, and this dependency edge is named 'build'. The task graph generation
process later resolves these dependencies to specific taskIds.
Decision Task
-------------
The decision task is the first task created when a new graph begins. It is
responsible for creating the rest of the task graph.
The decision task for pushes is defined in-tree, in ``.taskcluster.yml``. That
task description invokes ``mach taskcluster decision`` with some metadata about
the push. That mach command determines the optimized task graph, then calls
the TaskCluster API to create the tasks.
Note that this mach command is *not* designed to be invoked directly by humans.
Instead, use the mach commands described below, supplying ``parameters.yml``
from a recent decision task. These commands allow testing everything the
decision task does except the command-line processing and the
``queue.createTask`` calls.
Graph Generation
----------------
Graph generation, as run via ``mach taskgraph decision``, proceeds as follows:
#. For all kinds, generate all tasks. The result is the "full task set"
#. Create dependency links between tasks using kind-specific mechanisms. The
result is the "full task graph".
#. Filter the target tasks (based on a series of filters, such as try syntax,
tree-specific specifications, etc). The result is the "target task set".
#. Based on the full task graph, calculate the transitive closure of the target
task set. That is, the target tasks and all requirements of those tasks.
The result is the "target task graph".
#. Optimize the target task graph based on kind-specific optimization methods.
The result is the "optimized task graph" with fewer nodes than the target
task graph.
#. Create tasks for all tasks in the optimized task graph.
Transitive Closure
..................
Transitive closure is a fancy name for this sort of operation:
* start with a set of tasks
* add all tasks on which any of those tasks depend
* repeat until nothing changes
The effect is this: imagine you start with a linux32 test job and a linux64 test job.
In the first round, each test task depends on the test docker image task, so add that image task.
Each test also depends on a build, so add the linux32 and linux64 build tasks.
Then repeat: the test docker image task is already present, as are the build
tasks, but those build tasks depend on the build docker image task. So add
that build docker image task. Repeat again: this time, none of the tasks in
the set depend on a task not in the set, so nothing changes and the process is
complete.
And as you can see, the graph we've built now includes everything we wanted
(the test jobs) plus everything required to do that (docker images, builds).
Optimization
------------
The objective of optimization to remove as many tasks from the graph as
possible, as efficiently as possible, thereby delivering useful results as
quickly as possible. For example, ideally if only a test script is modified in
a push, then the resulting graph contains only the corresponding test suite
task.
A task is said to be "optimized" when it is either replaced with an equivalent,
already-existing task, or dropped from the graph entirely.
A task can be optimized if all of its dependencies can be optimized and none of
its inputs have changed. For a task on which no other tasks depend (a "leaf
task"), the optimizer can determine what has changed by looking at the
version-control history of the push: if the relevant files are not modified in
the push, then it considers the inputs unchanged. For tasks on which other
tasks depend ("non-leaf tasks"), the optimizer must replace the task with
another, equivalent task, so it generates a hash of all of the inputs and uses
that to search for a matching, existing task.
In some cases, such as try pushes, tasks in the target task set have been
explicitly requested and are thus excluded from optimization. In other cases,
the target task set is almost the entire task graph, so targetted tasks are
considered for optimization. This behavior is controlled with the
``optimize_target_tasks`` parameter.
Action Tasks
------------
Action Tasks are tasks which help you to schedule new jobs via Treeherder's
"Add New Jobs" feature. The Decision Task creates a YAML file named
``action.yml`` which can be used to schedule Action Tasks after suitably replacing
``{{decision_task_id}}`` and ``{{task_labels}}``, which correspond to the decision
task ID of the push and a comma separated list of task labels which need to be
scheduled.
This task invokes ``mach taskgraph action-task`` which builds up a task graph of
the requested tasks. This graph is optimized using the tasks running initially in
the same push, due to the decision task.
So for instance, if you had already requested a build task in the ``try`` command,
and you wish to add a test which depends on this build, the original build task
is re-used.
Action Tasks are currently scheduled by
[pulse_actions](https://github.com/mozilla/pulse_actions). This feature is only
present on ``try`` pushes for now.
Mach commands
-------------
A number of mach subcommands are available aside from ``mach taskgraph
decision`` to make this complex system more accesssible to those trying to
understand or modify it. They allow you to run portions of the
graph-generation process and output the results.
``mach taskgraph tasks``
Get the full task set
``mach taskgraph full``
Get the full task graph
``mach taskgraph target``
Get the target task set
``mach taskgraph target-graph``
Get the target task graph
``mach taskgraph optimized``
Get the optimized task graph
Each of these commands taskes a ``--parameters`` option giving a file with
parameters to guide the graph generation. The decision task helpfully produces
such a file on every run, and that is generally the easiest way to get a
parameter file. The parameter keys and values are described in
:doc:`parameters`; using that information, you may modify an existing
``parameters.yml`` or create your own.
Task Parameterization
---------------------
A few components of tasks are only known at the very end of the decision task
-- just before the ``queue.createTask`` call is made. These are specified
using simple parameterized values, as follows:
``{"relative-datestamp": "certain number of seconds/hours/days/years"}``
Objects of this form will be replaced with an offset from the current time
just before the ``queue.createTask`` call is made. For example, an
artifact expiration might be specified as ``{"relative-timestamp": "1
year"}``.
``{"task-reference": "string containing <dep-name>"}``
The task definition may contain "task references" of this form. These will
be replaced during the optimization step, with the appropriate taskId for
the named dependency substituted for ``<dep-name>`` in the string.
Multiple labels may be substituted in a single string, and ``<<>`` can be
used to escape a literal ``<``.
Taskgraph JSON Format
---------------------
Task graphs -- both the graph artifacts produced by the decision task and those
output by the ``--json`` option to the ``mach taskgraph`` commands -- are JSON
objects, keyed by label, or for optimized task graphs, by taskId. For
convenience, the decision task also writes out ``label-to-taskid.json``
containing a mapping from label to taskId. Each task in the graph is
represented as a JSON object.
Each task has the following properties:
``task_id``
The task's taskId (only for optimized task graphs)
``label``
The task's label
``attributes``
The task's attributes
``dependencies``
The task's in-graph dependencies, represented as an object mapping
dependency name to label (or to taskId for optimized task graphs)
``task``
The task's TaskCluster task definition.
``kind_implementation``
The module and the class name which was used to implement this particular task.
It is always of the form ``<module-path>:<object-path>``
The results from each command are in the same format, but with some differences
in the content:
* The ``tasks`` and ``target`` subcommands both return graphs with no edges.
That is, just collections of tasks without any dependencies indicated.
* The ``optimized`` subcommand returns tasks that have been assigned taskIds.
The dependencies array, too, contains taskIds instead of labels, with
dependencies on optimized tasks omitted. However, the ``task.dependencies``
array is populated with the full list of dependency taskIds. All task
references are resolved in the optimized graph.
The output of the ``mach taskgraph`` commands are suitable for processing with
the `jq <https://stedolan.github.io/jq/>`_ utility. For example, to extract all
tasks' labels and their dependencies:
.. code-block:: shell
jq 'to_entries | map({label: .value.label, dependencies: .value.dependencies})'