third_party_mesa3d/.gitlab-ci
Tomeu Vizoso d62dd8b0cb gitlab-ci: Switch LAVA jobs to use shared dEQP runner
Take one step towards sharing code between the LAVA and non-LAVA jobs,
with the goals of reducing maintenance burden and use of computational
resources.

The env var DEQP_NO_SAVE_RESULTS allows us to skip the procesing of the
XML result files, which can take a long time and is not useful in the
LAVA case as we are not uploading artifacts anywhere at the moment.

Signed-off-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
2020-01-06 14:27:36 +01:00
..
container gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
piglit llvmpipe: enable ARB_shader_group_vote. 2019-12-30 05:30:30 +00:00
arm64.config gitlab-ci: Test Panfrost on T720 GPUs 2019-12-03 04:25:04 +00:00
arm.config gitlab-ci/lava: Test Lima driver with dEQP 2019-10-10 14:50:14 +00:00
build-cts-runner.sh gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
build-deqp-gl.sh gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
build-deqp-vk.sh gitlab-ci: build dEQP VK 1.1.6 in the x86 test image for VK 2019-12-06 10:57:52 +01:00
build-piglit.sh gitlab-ci: bump piglit checkout commit 2019-12-04 15:27:41 +00:00
create-rootfs.sh gitlab-ci: Move LAVA-related files into top-level ci dir 2019-10-06 07:47:41 -07:00
cross-xfail-i386 ci: Run tests on i386 cross builds 2019-09-17 14:53:57 -04:00
deqp-default-skips.txt ci: Make the skip list regexes match the full test name. 2019-11-12 12:54:04 -08:00
deqp-freedreno-a307-fails.txt freedreno/a3xx: Mostly fix min-vs-mag filtering decisions on non-mipmap tex. 2019-09-26 11:27:31 -07:00
deqp-freedreno-a630-fails.txt gitlab-ci/a630: Drop the MSAA expected failure. 2019-09-13 13:50:54 -07:00
deqp-freedreno-a630-skips.txt gitlab-ci/freedreno/a6xx: remove most of the flakes 2019-11-22 13:48:29 -08:00
deqp-lima-fails.txt lima: add support for gl_PointSize 2019-11-05 17:44:56 -08:00
deqp-lima-skips.txt lima: add support for gl_PointSize 2019-11-05 17:44:56 -08:00
deqp-llvmpipe-fails.txt gitlab-ci: Run the GLES2 CTS on llvmpipe. 2019-08-13 10:30:01 -07:00
deqp-panfrost-t720-fails.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t720-skips.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t760-fails.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t760-skips.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t820-fails.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t820-skips.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t860-fails.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-panfrost-t860-skips.txt gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-radv-polaris10-skips.txt gitlab-ci: add a job that runs Vulkan CTS with RADV conditionally 2019-12-06 10:58:03 +01:00
deqp-runner.sh gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
deqp-softpipe-fails.txt ci: Enable all of GLES3/3.1 testing for softpipe. 2019-11-12 12:54:04 -08:00
deqp-softpipe-skips.txt ci: Enable all of GLES3/3.1 testing for softpipe. 2019-11-12 12:54:04 -08:00
generate_lava.py gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
lava-deqp.yml.jinja2 gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
lava-gitlab-ci.yml gitlab-ci: Switch LAVA jobs to use shared dEQP runner 2020-01-06 14:27:36 +01:00
meson-build.bat gitlab-ci: Add a job for meson on windows 2019-10-25 22:47:32 +00:00
meson-build.sh gitlab-ci: Move artifact preparation to separate script 2019-11-12 10:14:26 +01:00
prepare-artifacts.sh ci: Remove old commented copy of freedreno artifacts. 2019-11-12 12:54:04 -08:00
README.md freedreno: Introduce gitlab-based CI. 2019-09-12 10:55:42 -07:00
run-shader-db.sh Revert "ci: Switch over to an autoscaling GKE cluster for builds." 2019-11-06 11:38:07 -08:00
scons-build.sh scons: Print a deprecation warning about using scons on not windows 2019-10-24 18:33:50 +00:00
x86_64-w64-mingw32 gitlab-ci: Add a pkg-config for mingw 2019-10-16 23:26:09 +00:00

Mesa testing using gitlab-runner

The goal of the "test" stage of the .gitlab-ci.yml is to do pre-merge testing of Mesa drivers on various platforms, so that we can ensure no regressions are merged, as long as developers are merging code using the "Merge when pipeline completes" button.

This document only covers the CI from .gitlab-ci.yml and this directory. For other CI systems, see Intel's Mesa CI or panfrost's LAVA-based CI (src/gallium/drivers/panfrost/ci/)

Software architecture

For freedreno and llvmpipe CI, we're using gitlab-runner on the test devices (DUTs), cached docker containers with VK-GL-CTS, and the normal shared x86_64 runners to build the Mesa drivers to be run inside of those containers on the DUTs.

The docker containers are rebuilt from the debian-install.sh script when DEBIAN_TAG is changed in .gitlab-ci.yml, and debian-test-install.sh when DEBIAN_ARM64_TAG is changed in .gitlab-ci.yml. The resulting images are around 500MB, and are expected to change approximately weekly (though an individual developer working on them may produce many more images while trying to come up with a working MR!).

gitlab-runner is a client that polls gitlab.freedesktop.org for available jobs, with no inbound networking requirements. Jobs can have tags, so we can have DUT-specific jobs that only run on runners with that tag marked in the gitlab UI.

Since dEQP takes a long time to run, we mark the job as "parallel" at some level, which spawns multiple jobs from one definition, and then deqp-runner.sh takes the corresponding fraction of the test list for that job.

To reduce dEQP runtime (or avoid tests with unreliable results), a deqp-runner.sh invocation can provide a list of tests to skip. If your driver is not yet conformant, you can pass a list of expected failures, and the job will only fail on tests that aren't listed (look at the job's log for which specific tests failed).

DUT requirements

DUTs must have a stable kernel and GPU reset.

If the system goes down during a test run, that job will eventually time out and fail (default 1 hour). However, if the kernel can't reliably reset the GPU on failure, bugs in one MR may leak into spurious failures in another MR. This would be an unacceptable impact on Mesa developers working on other drivers.

DUTs must be able to run docker

The Mesa gitlab-runner based test architecture is built around docker, so that we can cache the debian package installation and CTS build step across multiple test runs. Since the images are large and change approximately weekly, the DUTs also need to be running some script to prune stale docker images periodically in order to not run out of disk space as we rev those containers (perhaps this script).

Note that docker doesn't allow containers to be stored on NFS, and doesn't allow multiple docker daemons to interact with the same network block device, so you will probably need some sort of physical storage on your DUTs.

DUTs must be public

By including your device in .gitlab-ci.yml, you're effectively letting anyone on the internet run code on your device. docker containers may provide some limited protection, but how much you trust that and what you do to mitigate hostile access is up to you.

DUTs must expose the dri device nodes to the containers.

Obviously, to get access to the HW, we need to pass the render node through. This is done by adding devices = ["/dev/dri"] to the runners.docker section of /etc/gitlab-runner/config.toml.

HW CI farm expectations

To make sure that testing of one vendor's drivers doesn't block unrelated work by other vendors, we require that a given driver's test farm produces a spurious failure no more than once a week. If every driver had CI and failed once a week, we would be seeing someone's code getting blocked on a spurious failure daily, which is an unacceptable cost to the project.

Additionally, the test farm needs to be able to provide a short enough turnaround time that people can regularly use the "Merge when pipeline succeeds" button successfully (until we get marge-bot in place on freedesktop.org). As a result, we require that the test farm be able to handle a whole pipeline's worth of jobs in less than 5 minutes (to compare, the build stage is about 10 minutes, if you could get all your jobs scheduled on the shared runners in time.).

If a test farm is short the HW to provide these guarantees, consider dropping tests to reduce runtime. VK-GL-CTS/scripts/log/bottleneck_report.py can help you find what tests were slow in a results.qpa file. Or, you can have a job with no parallel field set and:

  variables:
    CI_NODE_INDEX: 1
    CI_NODE_TOTAL: 10

to just run 1/10th of the test list.

If a HW CI farm goes offline (network dies and all CI pipelines end up stalled) or its runners are consistenly spuriously failing (disk full?), and the maintainer is not immediately available to fix the issue, please push through an MR disabling that farm's jobs by adding '.' to the front of the jobs names until the maintainer can bring things back up. If this happens, the farm maintainer should provide a report to mesa-dev@lists.freedesktop.org after the fact explaining what happened and what the mitigation plan is for that failure next time.