Bug 1879244 - Add documentation for greening up tests with --new-test-config. r=aryx

Differential Revision: https://phabricator.services.mozilla.com/D201041
This commit is contained in:
Joel Maher 2024-02-09 14:56:47 +00:00
parent 86107ae8e9
commit 41e1354d68
2 changed files with 293 additions and 161 deletions

View File

@ -7,173 +7,96 @@ get to this stage, you will have seen a try push with all the tests running
available to run tests.
For the purpose of this document, assume you are tasked with upgrading Windows
10 OS from 1803 -> 1903. To simplify this we can call this `windows_1903`, and
we need to:
10 OS from version 1803 -> 1903. To simplify this we can call this `windows_1903`,
and we need to:
* create meta bug
* push to try
* analyze test failures
* disable tests in manifests
* repeat try push until no failures
* run skip-fails
* repeat 2 more times
* land changes and turn on tests
* turn on run only failures
* file bugs for test failures
There are many edge cases, and I will outline them inside each step.
If you are running this manually or on configs/tests that are not supported with
`./mach try --new-test-config`, then please follow the steps `here <manual.html>`__
Create Meta Bug
---------------
This is a simple step where you create a meta bug to track the failures associated
with the tests you are greening up. If this is a test suite (i.e. `devtools`), it
is ok to have a meta bug just for the test suite and the new platform.
All bugs related to tests skipped or failing will be blocking this meta bug.
Push to Try Server
------------------
As you have new machines (or cloud instances) available with the updated
OS/config, it is time to push to try.
Now that you have a configuration setup and machines available via try server, it
is time to run try. If you are migrating mochitest or xpcshell, then you can do:
In order to run all tests, we would need to execute:
``./mach try fuzzy --no-artifact -q 'test-windows !-raptor- !-talos- --rebuild 5``
``./mach try fuzzy --no-artifact --full --rebuild 10 --new-task-config -q 'test-windows10-64-1903 mochitest-browser-chrome !ccov !ship !browsertime !talos !asan'``
There are a few exceptions here:
This will run many tests (thanks to --full and --rebuild 10), but will give plenty
of useful data.
* Perf tests don't need to be run (hence the ``!-raptor- !-talos-``)
* Need to make sure we are not building with artifact builds (hence the
``--no-artifact``)
* There are jobs hidden behind tier-3, some for a good reason (code coverage is
a good example, but fission tests might not be green)
In the scenario you are migrating tests such as:
* performance
* web-platform-tests
* reftest / crashtest / jsreftest
* mochitest-webgl (has a different process for test skipping)
* cppunittest / gtest / junit
* marionette / firefox-ui / telemetry
The last piece to sort out is running on the new config, here are some
considerations for new configs:
then please follow the steps `here <manual.html>`__
* duplicated jobs (i.e. fission, a11y-checks), you can just run those specific
tasks: ``./mach try fuzzy --no-artifact -q 'test-windows fission --rebuild
5``
* new OS/hardware (i.e. aarch64, os upgrade), you need to reference the new
hardware, typically this is with ``--worker-override``: ``./mach try fuzzy
--no-artifact -q 'test-windows --rebuild 5 --worker-override
t-win10-64=gecko-t/t-win10-64-1903``
If you are migrating to a small machine pool, it is best to avoid `--rebuild 10` and
instead do `--rebuild 3`. Likewise please limit your jobs to be the specific test
suite and variant. The size of a worker pool is shown at the Workers page of the
Taskcluster instance.
* the risk here is a scenario where hardware is limited, then ``--rebuild
5`` will create too many tasks and some will expire.
* in low hardware situations, either run a subset of tests (i.e.
web-platform-tests, mochitest), or ``--rebuild 2`` and repeat.
Run skip-fails
--------------
When the try push is completed it is time to run skip-fails. Skip-fails will
look at all the test results and automatically create a set of local changes
with skip-if conditions to green up the tests faster.
``./mach manifest skip-fails --b bugzilla.mozilla.org -m <meta_bug_id> --turbo "https://treeherder.mozilla.org/jobs?repo=try&revision=<rev>"``
Please input the proper `meta_bug_id` and `rev` into the above command.
The first time running this, you will need to get a `bugzilla api key <https://bugzilla.mozilla.org/userprefs.cgi?tab=apikey>`__. copy
this key and add it to your `~/.config/python-bugzilla/bugzilla-rc` file:
.. code-block:: none
cat bugzillarc
[DEFAULT]
url = https://bugzilla.mozilla.org
[bugzilla.mozilla.org]
api_key = <key>
When the command finishes, you will have new bugs created that are blocking the
meta bug. In addition you will have many changes to manifests adding skip-if
conditions. For tests than fail 40% of the time or for entire manifests that
take >20 minutes to run on opt or >40 minutes on debug.
You will need to create a commit (or `--amend` your previous commit if this is round 2 or 3):
``hg commit -m "Bug <meta_bug_id> - Green up tests for <suite> on <platform>"``
Analyze Test Failures
---------------------
Repeat 2 More Times
-------------------
A try push will take many hours, it is best to push in the afternoon, ensure
some jobs are running, then come back the next day.
In 3 rounds this should be complete and ready to submit for review and turn on
the new tests.
The best way to look at test failures is to use Push Health to avoid misleading
data. Push Health will bucket failures into possible regressions, known
regression, etc. When looking at 5 data points (from ``--rebuild 5``), this
will filter out intermittent failures.
There will be additional failures, those will follow the normal process of
intermittents.
There are many reasons you might have invalid or misleading data:
# Tests fail intermittently, we need a pattern to know if it is consistent or
intermittent.
# We still want to disable high frequency intermittent tests, those are just
annoying.
# You could be pushing off a bad base revision (regression or intermittent that
comes from the base revision).
# The machines you run on could be bad, skewing the data.
# Infrastructure problems could cause jobs to fail at random places, repeated
jobs filter that out.
# Some failures could affect future tests in the same browser session or tasks.
# If a crash occurs, or we timeout- it is possible that we will not run all of
the tests in the task, therefore believing a test was run 5 times, but maybe it
was only run once (and failed), or never run at all.
# Task failures that do not have a test name (leak on shutdown, crash on
shutdown, timeout on shutdown, etc.)
That is a long list of reasons to not trust the data, luckily most of the time
using ``--rebuild 5`` will give us enough data to give enough confidence we
found all failures and can ignore random/intermittent failures.
Knowing the reasons for misleading data, here is a way to use `Push Health
<https://treeherder.mozilla.org/push-health/push?revision=abaff26f8e084ac719bea0438dba741ace3cf5d8&repo=try&testGroup=pr>`__.
* Alternatively, you could use the `API
<https://treeherder.mozilla.org/api/project/try/push/health/?revision=abaff26f8e084ac719bea0438dba741ace3cf5d8>`__
to get raw data and work towards building a tool
* If you write a tool, you need to parse the resulting JSON file and keep in
mind to build a list of failures and match it with a list of jobnames to find
how many times the job ran and failed/passed.
The main goal here is to know what <path>/<filenames> are failing, and having a
list of those. Ideally you would record some additional information like
timeout, crash, failure, etc. In the end you might end up with::
dom/html/test/test_fullscreen-api.html, scrollbar
gfx/layers/apz/test/mochitest/test_group_hittest.html, scrollbar
image/test/mochitest/test_animSVGImage.html, timeout
browser/base/content/test/general/browser_restore_isAppTab.js, crashed
Disable Tests in the Manifest Files
-----------------------------------
This part of the process can seem tedious and is difficult to automate without
making our manifests easier to access programatically.
The code sheriffs have been using `this documentation
<https://wiki.mozilla.org/Auto-tools/Projects/Stockwell/disable-recommended>`__
for training and reference when they disable intermittents.
First you need to add a keyword to be available in the manifest (e.g. ``skip-if
= windows_1903``).
There are many exceptions, the bulk of the work will fall into one of 4
categories:
# `manifestparser <mochitest_xpcshell_manifest_keywords>`_: \*.ini (mochitest*,
firefox-ui, marionette, xpcshell) easy to edit by adding a ``fail-if =
windows_1903 # <comment>``, a few exceptions here
# `reftest <reftest_manifest_keywords>`_: \*.list (reftest, crashtest) need to
add a ``fuzzy-if(windows_1903, A, B)``, this is more specific
# web-platform-test: testing/web-platform/meta/\*\*.ini (wpt, wpt-reftest,
etc.) need to edit/add testing/web-platform/meta/<path>/<testname>.ini, and add
expected results
# Other (compiled tests, jsreftest, etc.) edit source code, ask for help.
Basically we want to take every non intermittent failure found from push health
and edit the manifest, this typically means:
* Finding the proper manifest.
* Adding the right text to the manifest.
To find the proper manifest, it is typically <path>/<harness>.[ini|list].
There are exceptions and if in doubt use searchfox.org/ to find the manifest
which contains the testname.
Once you have the manifest, open it in an editor, and search for the exact test
name (there could be similar named tests).
Rerun Try Push, Repeat as Necessary
-----------------------------------
It is important to test your changes and for a new platform that will be
sheriffed, to rerun all the tests at scale.
With your change in a commit, push again to try with ``--rebuild 5`` and come
back the next day.
As there are so many edge cases, it is quite likely that you will have more
failures, mentally plan on 3 iterations of this, where each iteration has fewer
failures.
Once you get a full push to show no persistent failures, it is time to land
those changes and turn on the new tests. There is a large risk here that the
longer you take to find all failures, the greater the chance of:
* Bitrot of your patch
* New tests being added which could fail on your config
* Other edits to tests/tools which could affect your new config
Since the new config process is designed to find failures fast and get the
changes landed fast, we do not need to ask developers for review, that comes
after the new config is running successfully where we notify the teams of what
tests are failing.
Land Changes and Turn on Tests
------------------------------
@ -205,18 +128,3 @@ mochitest-gpu, browser-chrome, devtools, web-platform-tests, crashtest, etc.),
there will need to be a corresponding tier-3 job that is created.
TODO: point to examples of how to add this after we get our first jobs running.
File Bugs for Test Failures
---------------------------
Once the failure jobs are running on mozilla-central, now we have full coverage
and the ability to run tests on try server. There could be >100 tests that are
marked as ``fail-if`` and that would take a lot of time to file bugs. Instead
we will file a bug for each manifest that is edited, typically this reduces the
bugs to about 40% the total tests (average out to 2.5 test failures/manifest).
When filing the bug, indicate the timeline, how to run the failure, link to the
bug where we created the config, describe briefly the config change (i.e.
upgrade windows 10 rom version 1803 to 1903), and finally needinfo the triage
owner indicating this is a heads up and these tests are running reguarly on
mozilla-central for the next 6-7 weeks.

View File

@ -0,0 +1,224 @@
:orphan:
Turning on Firefox tests for a new configuration (manual)
=========================================================
You are ready to go with turning on Firefox tests for a new config. Once you
get to this stage, you will have seen a try push with all the tests running
(many not green) to verify some tests pass and there are enough machines
available to run tests.
For the purpose of this document, assume you are tasked with upgrading Windows
10 OS from 1803 -> 1903. To simplify this we can call this `windows_1903`, and
we need to:
* push to try
* analyze test failures
* disable tests in manifests
* repeat try push until no failures
* file bugs for test failures
* land changes and turn on tests
* turn on run only failures
There are many edge cases, and I will outline them inside each step.
Push to Try Server
------------------
As you have new machines (or cloud instances) available with the updated
OS/config, it is time to push to try.
In order to run all tests, we would need to execute:
``./mach try fuzzy --no-artifact -q 'test-windows !-raptor- !-talos- --rebuild 10``
There are a few exceptions here:
* Perf tests don't need to be run (hence the ``!-raptor- !-talos-``)
* Need to make sure we are not building with artifact builds (hence the
``--no-artifact``)
* There are jobs hidden behind tier-3, some for a good reason (code coverage is
a good example, but fission tests might not be green)
The last piece to sort out is running on the new config, here are some
considerations for new configs:
* duplicated jobs (i.e. fission, a11y-checks), you can just run those specific
tasks: ``./mach try fuzzy --no-artifact -q 'test-windows fission --rebuild
5``
* new OS/hardware (i.e. aarch64, os upgrade), you need to reference the new
hardware, typically this is with ``--worker-override``: ``./mach try fuzzy
--no-artifact -q 'test-windows --rebuild 10 --worker-override
t-win10-64=gecko-t/t-win10-64-1903``
* the risk here is a scenario where hardware is limited, then ``--rebuild
10`` will create too many tasks and some will expire.
* in low hardware situations, either run a subset of tests (i.e.
web-platform-tests, mochitest), or ``--rebuild 3`` and repeat.
Analyze Test Failures
---------------------
A try push will take many hours, it is best to push when you start work and
then results will be ready later in your day, or push at the end of your day
and results will be ready when you come back to work the next day. Please
make sure some tasks start before walking away, otherwise a small typo can
delay this process by hours or a full day.
The best way to look at test failures is to use Push Health to avoid misleading
data. Push Health will bucket failures into possible regressions, known
regression, etc. When looking at 5 data points (from ``--rebuild 10``), this
will filter out intermittent failures.
There are many reasons you might have invalid or misleading data:
# Tests fail intermittently, we need a pattern to know if it is consistent or
intermittent.
# We still want to disable high frequency intermittent tests, those are just
annoying.
# You could be pushing off a bad base revision (regression or intermittent that
comes from the base revision).
# The machines you run on could be bad, skewing the data.
# Infrastructure problems could cause jobs to fail at random places, repeated
jobs filter that out.
# Some failures could affect future tests in the same browser session or tasks.
# If a crash occurs, or we timeout- it is possible that we will not run all of
the tests in the task, therefore believing a test was run 5 times, but maybe it
was only run once (and failed), or never run at all.
# Task failures that do not have a test name (leak on shutdown, crash on
shutdown, timeout on shutdown, etc.)
That is a long list of reasons to not trust the data, luckily most of the time
using ``--rebuild 10`` will give us enough data to give enough confidence we
found all failures and can ignore random/intermittent failures.
Knowing the reasons for misleading data, here is a way to use `Push Health
<https://treeherder.mozilla.org/push-health/push?revision=abaff26f8e084ac719bea0438dba741ace3cf5d8&repo=try&testGroup=pr>`__.
* Alternatively, you could use the `API
<https://treeherder.mozilla.org/api/project/try/push/health/?revision=abaff26f8e084ac719bea0438dba741ace3cf5d8>`__
to get raw data and work towards building a tool
* If you write a tool, you need to parse the resulting JSON file and keep in
mind to build a list of failures and match it with a list of jobnames to find
how many times the job ran and failed/passed.
The main goal here is to know what <path>/<filenames> are failing, and having a
list of those. Ideally you would record some additional information like
timeout, crash, failure, etc. In the end you might end up with::
dom/html/test/test_fullscreen-api.html, scrollbar
gfx/layers/apz/test/mochitest/test_group_hittest.html, scrollbar
image/test/mochitest/test_animSVGImage.html, timeout
browser/base/content/test/general/browser_restore_isAppTab.js, crashed
Disable Tests in the Manifest Files
-----------------------------------
The code sheriffs have been using `this documentation
<https://wiki.mozilla.org/Auto-tools/Projects/Stockwell/disable-recommended>`__
for training and reference when they disable intermittents.
First you need to add a keyword to be available in the manifest (e.g. ``skip-if
= windows_1903``).
There are many exceptions, the bulk of the work will fall into one of 4
categories:
# `manifestparser <mochitest_xpcshell_manifest_keywords>`_: \*.toml (mochitest*,
firefox-ui, marionette, xpcshell) easy to edit by adding a ``skip-if =
windows_1903 # <comment>``, a few exceptions here
# `reftest <reftest_manifest_keywords>`_: \*.list (reftest, crashtest) need to
add a ``fuzzy-if(windows_1903, A, B)``, this is more specific
# web-platform-test: testing/web-platform/meta/\*\*.ini (wpt, wpt-reftest,
etc.) need to edit/add testing/web-platform/meta/<path>/<testname>.ini, and add
expected results
# Other (compiled tests, jsreftest, etc.) edit source code, ask for help.
Basically we want to take every non-intermittent failure found from push health
and edit the manifest, this typically means:
* Finding the proper manifest.
* Adding the right text to the manifest.
To find the proper manifest, it is typically <path>/<harness>.[toml|list].
There are exceptions and if in doubt use searchfox.org/ to find the manifest
which contains the testname.
Once you have the manifest, open it in an editor, and search for the exact test
name (there could be similar named tests).
Rerun Try Push, Repeat as Necessary
-----------------------------------
It is important to test your changes and for a new platform that will be
sheriffed, to rerun all the tests at scale.
With your change in a commit, push again to try with ``--rebuild 10`` and come
back the next day.
As there are so many edge cases, it is quite likely that you will have more
failures, mentally plan on 3 iterations of this, where each iteration has fewer
failures.
Once you get a full push to show no persistent failures, it is time to land
those changes and turn on the new tests. There is a large risk here that the
longer you take to find all failures, the greater the chance of:
* Bitrot of your patch
* New tests being added which could fail on your config
* Other edits to tests/tools which could affect your new config
Since the new config process is designed to find failures fast and get the
changes landed fast, we do not need to ask developers for review, that comes
after the new config is running successfully where we notify the teams of what
tests are failing.
File Bugs for Test Failures
---------------------------
Once the failure jobs are running on mozilla-central, now we have full coverage
and the ability to run tests on try server. There could be >100 tests that are
marked as ``skip-if`` and that would take a lot of time to file bugs. Instead
we will file a bug for each manifest that is edited, typically this reduces the
bugs to about 40% the total tests (average out to 2.5 test failures/manifest).
When filing the bug, indicate the timeline, how to run the failure, link to the
bug where we created the config, describe briefly the config change (i.e.
upgrade windows 10 from version 1803 to 1903), and finally needinfo the triage
owner indicating this is a heads up and these tests are running reguarly on
mozilla-central for the next 6-7 weeks.
Land Changes and Turn on Tests
------------------------------
After you have a green test run, it is time to land the patches. There could
be changes needed to the taskgraph in order to add the new hardware type and
duplicate tests to run on both the old and the new, or create a new variant and
denote which tests to run on that variant.
Using our example of ``windows_1903``, this would be a new worker type that
would require these edits:
* `transforms/tests.py <https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/tests.py#97>`__ (duplicate windows 10 entries)
* `test-platforms.py <https://searchfox.org/mozilla-central/source/taskcluster/ci/test/test-platforms.yml#229>`__ (copy windows10 debug/opt/shippable/asan entries and make win10_1903)
* `test-sets.py <https://searchfox.org/mozilla-central/source/taskcluster/ci/test/test-sets.yml#293>`__ (ideally you need nothing, otherwise copy ``windows-tests`` and edit the test list)
In general this should allow you to have tests scheduled with no custom flags
in try server and all of these will be scheduled by default on
``mozilla-central``, ``autoland``, and ``release-branches``.
Turn on Run Only Failures
-------------------------
Now that we have tests running regularly, the next step is to take all the
disabled tests and run them in the special failures job.
We have a basic framework created, but for every test harness (i.e. xpcshell,
mochitest-gpu, browser-chrome, devtools, web-platform-tests, crashtest, etc.),
there will need to be a corresponding tier-3 job that is created.
TODO: point to examples of how to add this after we get our first jobs running.