gecko-dev/services/healthreport/docs/identifiers.rst
Gregory Szorc dc4476a645 Bug 968419 - Store and submit a persistent health report identifier; r=rnewman, r=bsmedberg
Up to this point, Firefox Health Report has generated and submitted a
random UUID with each upload. Generated UUIDs were stored on the client.
During upload, the client asked the server to delete all old UUIDs.
Well-behaving clients thus left at most one record/ID on the server.

Unfortunately, clients in the wild have not been behaving properly. We
are seeing multiple documents on the server that appear to come from the
same client. Clients are uploading new records but failing to delete the
old ones. These old, undeleted "orphan" records are severely impacting
the ability to derive useful knowledge from FHR data because it is
difficult, resource intensive, and error prone to filter the records on
the server. This is undermining the ability for FHR data to be put to
good use.

This patch introduces a persistent client identifier. When the client is
initialized, it generates a random UUID. That UUID is persisted to the
profile and sent as part of every upload.

For privacy reasons, if a client opts out of data submission, the client
ID will be reset as soon as all remote data has been deleted.

We still issue and send upload IDs. They exist mostly for forensics
purposes so we may log client behavior and more accurately determine
what exactly misbehaving, orphan-producing clients are doing.

It is worth noting that this persistent client identifier will not solve
all problems of branching and orphaned records. For example, profile
copying will result in multiple clients sharing a client identifier. A
"client ID version" field has been added to facilitate an upgrade path
towards client IDs with different generation semantics.

--HG--
extra : rebase_source : b761daab39fb07b6ab8883819d68bf53462314a0
2014-02-20 11:30:52 -08:00

84 lines
2.8 KiB
ReStructuredText

.. _healthreport_identifiers:
===========
Identifiers
===========
Firefox Health Report records some identifiers to keep track of clients
and uploaded documents.
Identifier Types
================
Document/Upload IDs
-------------------
A random UUID called the *Document ID* or *Upload ID* is generated when the FHR
client creates or uploads a new document.
When clients generate a new *Document ID*, they persist this ID to disk
**before** the upload attempt.
As part of the upload, the client sends all old *Document IDs* to the server
and asks the server to delete them. In well-behaving clients, the server
has a single record for each client with a randomly-changing *Document ID*.
Client IDs
----------
A *Client ID* is an identifier that **attempts** to uniquely identify an
individual FHR client. Please note the emphasis on *attempts* in that last
sentence: *Client IDs* do not guarantee uniqueness.
The *Client ID* is generated when the client first runs or as needed.
The *Client ID* is transferred to the server as part of every upload. The
server is thus able to affiliate multiple document uploads with a single
*Client ID*.
Client ID Versions
^^^^^^^^^^^^^^^^^^
The semantics for how a *Client ID* is generated are versioned.
Version 1
The *Client ID* is a randomly-generated UUID.
History of Identifiers
======================
In the beginning, there were just *Document IDs*. The thinking was clients
would clean up after themselves and leave at most 1 active document on the
server.
Unfortunately, this did not work out. Using brute force analysis to
deduplicate records on the server, a number of interesting patterns emerged.
Orphaning
Clients would upload a new payload while not deleting the old payload.
Divergent records
Records would share data up to a certain date and then the data would
almost completely diverge. This appears to be indicative of profile
copying.
Rollback
Records would share data up to a certain date. Each record in this set
would contain data for a day or two but no extra data. This could be
explained by filesystem rollback on the client.
A significant percentage of the records on the server belonged to
misbehaving clients. Identifying these records was extremely resource
intensive and error-prone. These records were undermining the ability
to use Firefox Health Report data.
Thus, the *Client ID* was born. The intent of the *Client ID* was to
uniquely identify clients so the extreme effort required and the
questionable reliability of deduplicating server data would become
problems of the past.
The *Client ID* was originally a randomly-generated UUID (version 1). This
allowed detection of orphaning and rollback. However, these version 1
*Client IDs* were still susceptible to use on multiple profiles and
machines if the profile was copied.