Bug 1391476 - Add UID and GID to cache parameters; r=dustin

The UID and GID that a task executes under is dynamic. As a result,
caches need to be aware of the UID and GID that owns files otherwise
subsequent tasks could run into permission denied errors. This is
why `run-task --chown-recursive` exists. By recursively changing
ownership of persisted files, we ensure the current task is able
to read and write all existing files.

When you take a step back, you realize that chowning of cached
files is an expensive workaround. Yes, this results in cache hits.
But the cost is you potentially have to perform hundreds of thousands
of I/O system calls to mass chown. The ideal situation is that
UID/GID is consistent across tasks on any given cache and
potentially expensive permissions setting can be avoided. So, that's
what this commit does.

We add the task's UID and GID to run-task's requirements. When we
first see a cache, we record a UID and GID with it and chown the
empty cache directory to that UID and GID. Subsequent tasks using
this cache *must* use the same UID and GID or else run-task will
fail.

Since run-task now guarantees that all cache consumers use the same
UID and GID, we can avoid a potentially expensive recursive chown.

But there is an exception. In untrusted environments (namely Try),
we recursively chown existing caches if there is a uid/gid mismatch.
We do this because Try is a sandbox and any random task could
experiment with a non-standard uid/gid. That populated cache would
"poison" the cache for the next caller. Or vice-versa. It would be
annoying if caches were randomly poisoned due to Try pushes that
didn't realize there was a UID/GID mismatch. We could outlaw "bad"
UID and GIDs. But that makes the barrier to testing things on Try
harder. So, we go with the flow and recursively chown caches in
this scenario.

This change will shine light on all tasks using inconsistent UID
and GID values on the same cache. Bustage is anticipated.
Unfortunately, we can't easily know what will break. So it will be
one of those things where we will have to fix problems as they arise.
Fortunately, because caches are now tied to the content of run-task,
we only need to back out this change and tasks should revert to caches
without UID and GID pinning requirements and everything will work
again.

MozReview-Commit-ID: 2ka4rOnnXIp

--HG--
extra : rebase_source : ccb2b0a9230694f989775b26d5276fd3ac928af3
extra : source : 083d2e1cc8fe44b04e44f74bda3dd8bc75ba826c
This commit is contained in:
Gregory Szorc 2017-08-22 16:49:26 -07:00
parent 4e063535e6
commit 5fb99cdfc3
2 changed files with 58 additions and 4 deletions

View File

@ -37,6 +37,22 @@ FALLBACK_FINGERPRINT = {
":bf:d7:b8:40:84:01:48:9c:26:ce:d9"}
CACHE_UID_GID_MISMATCH = '''
There is a UID/GID mismatch on the cache. This likely means:
a) different tasks are running as a different user/group
b) different Docker images have different UID/GID for the same user/group
Our cache policy is that the UID/GID for ALL tasks must be consistent
for the lifetime of the cache. This eliminates permissions problems due
to file/directory user/group ownership.
To make this error go away, ensure that all Docker images are use
a consistent UID/GID and that all tasks using this cache are running as
the same user/group.
'''
NON_EMPTY_VOLUME = '''
error: volume %s is not empty
@ -281,6 +297,12 @@ def main(args):
else:
caches = []
if 'TASKCLUSTER_UNTRUSTED_CACHES' in os.environ:
untrusted_caches = True
del os.environ['TASKCLUSTER_UNTRUSTED_CACHES']
else:
untrusted_caches = False
our_requirements = {
# Include a version string that we can bump whenever to trigger
# fresh caches. The actual value is not relevant and doesn't need
@ -288,6 +310,10 @@ def main(args):
# hash into cache names, any change to this file/version is sufficient
# to force the use of a new cache.
b'version=1',
# Include the UID and GID the task will run as to ensure that tasks
# with different UID and GID don't share the same cache.
b'uid=%d' % uid,
b'gid=%d' % gid,
}
for cache in caches:
@ -297,12 +323,13 @@ def main(args):
requires_path = os.path.join(cache, '.cacherequires')
# The cache is empty. No validation necessary. Just set up our
# requirements file.
# The cache is empty. Configure it.
if not os.listdir(cache):
print_line(b'cache', b'cache %s is empty; writing requirements: '
b'%s\n' % (cache, b' '.join(sorted(our_requirements))))
# We write a requirements file so future invocations know what the
# requirements are.
with open(requires_path, 'wb') as fh:
fh.write(b'\n'.join(sorted(our_requirements)))
@ -321,16 +348,40 @@ def main(args):
cache, b' '.join(sorted(wanted_requirements))))
missing = wanted_requirements - our_requirements
if missing:
# Allow requirements mismatch for uid/gid if and only if caches
# are untrusted. This allows cache behavior on Try to be
# reasonable. Otherwise, random tasks could "poison" cache
# usability by introducing uid/gid mismatches. For untrusted
# environments like Try, this is a perfectly reasonable thing to
# allow.
if missing and untrusted_caches and running_as_root and \
all(s.startswith(('uid=', 'gid=')) for s in missing):
print_line(b'cache',
b'cache %s uid/gid mismatch; this is acceptable '
b'because caches for this task are untrusted; '
b'changing ownership to facilitate cache use\n' %
cache)
chown_recursive(cache, user.pw_name, group.gr_name, uid, gid)
# And write out the updated reality.
with open(requires_path, 'wb') as fh:
fh.write(b'\n'.join(sorted(our_requirements)))
elif missing:
print('requirements for populated cache %s differ from '
'this task' % cache)
print('cache requirements: %s' % ' '.join(sorted(
wanted_requirements)))
print('our requirements: %s' % ' '.join(sorted(
our_requirements)))
if any(s.startswith(('uid=', 'gid=')) for s in missing):
print(CACHE_UID_GID_MISMATCH)
return 1
chown_recursive(cache, user.pw_name, group.gr_name, uid, gid)
# We don't need to adjust permissions here because the cache is
# associated with a uid/gid and the first task should have set
# a proper owner/group.
# The cache has content and no requirements file. This shouldn't
# happen because run-task should be the first thing that touches a

View File

@ -740,6 +740,9 @@ def build_docker_worker_payload(config, task, task_def):
payload['env']['TASKCLUSTER_VOLUMES'] = ';'.join(
sorted(worker['volumes']))
if payload.get('cache') and skip_untrusted:
payload['env']['TASKCLUSTER_UNTRUSTED_CACHES'] = '1'
if features:
payload['features'] = features
if capabilities: