mirror of
https://github.com/mozilla/gecko-dev.git
synced 2024-11-02 07:05:24 +00:00
dc4476a645
Up to this point, Firefox Health Report has generated and submitted a random UUID with each upload. Generated UUIDs were stored on the client. During upload, the client asked the server to delete all old UUIDs. Well-behaving clients thus left at most one record/ID on the server. Unfortunately, clients in the wild have not been behaving properly. We are seeing multiple documents on the server that appear to come from the same client. Clients are uploading new records but failing to delete the old ones. These old, undeleted "orphan" records are severely impacting the ability to derive useful knowledge from FHR data because it is difficult, resource intensive, and error prone to filter the records on the server. This is undermining the ability for FHR data to be put to good use. This patch introduces a persistent client identifier. When the client is initialized, it generates a random UUID. That UUID is persisted to the profile and sent as part of every upload. For privacy reasons, if a client opts out of data submission, the client ID will be reset as soon as all remote data has been deleted. We still issue and send upload IDs. They exist mostly for forensics purposes so we may log client behavior and more accurately determine what exactly misbehaving, orphan-producing clients are doing. It is worth noting that this persistent client identifier will not solve all problems of branching and orphaned records. For example, profile copying will result in multiple clients sharing a client identifier. A "client ID version" field has been added to facilitate an upgrade path towards client IDs with different generation semantics. --HG-- extra : rebase_source : b761daab39fb07b6ab8883819d68bf53462314a0
84 lines
2.8 KiB
ReStructuredText
84 lines
2.8 KiB
ReStructuredText
.. _healthreport_identifiers:
|
|
|
|
===========
|
|
Identifiers
|
|
===========
|
|
|
|
Firefox Health Report records some identifiers to keep track of clients
|
|
and uploaded documents.
|
|
|
|
Identifier Types
|
|
================
|
|
|
|
Document/Upload IDs
|
|
-------------------
|
|
|
|
A random UUID called the *Document ID* or *Upload ID* is generated when the FHR
|
|
client creates or uploads a new document.
|
|
|
|
When clients generate a new *Document ID*, they persist this ID to disk
|
|
**before** the upload attempt.
|
|
|
|
As part of the upload, the client sends all old *Document IDs* to the server
|
|
and asks the server to delete them. In well-behaving clients, the server
|
|
has a single record for each client with a randomly-changing *Document ID*.
|
|
|
|
Client IDs
|
|
----------
|
|
|
|
A *Client ID* is an identifier that **attempts** to uniquely identify an
|
|
individual FHR client. Please note the emphasis on *attempts* in that last
|
|
sentence: *Client IDs* do not guarantee uniqueness.
|
|
|
|
The *Client ID* is generated when the client first runs or as needed.
|
|
|
|
The *Client ID* is transferred to the server as part of every upload. The
|
|
server is thus able to affiliate multiple document uploads with a single
|
|
*Client ID*.
|
|
|
|
Client ID Versions
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
The semantics for how a *Client ID* is generated are versioned.
|
|
|
|
Version 1
|
|
The *Client ID* is a randomly-generated UUID.
|
|
|
|
History of Identifiers
|
|
======================
|
|
|
|
In the beginning, there were just *Document IDs*. The thinking was clients
|
|
would clean up after themselves and leave at most 1 active document on the
|
|
server.
|
|
|
|
Unfortunately, this did not work out. Using brute force analysis to
|
|
deduplicate records on the server, a number of interesting patterns emerged.
|
|
|
|
Orphaning
|
|
Clients would upload a new payload while not deleting the old payload.
|
|
|
|
Divergent records
|
|
Records would share data up to a certain date and then the data would
|
|
almost completely diverge. This appears to be indicative of profile
|
|
copying.
|
|
|
|
Rollback
|
|
Records would share data up to a certain date. Each record in this set
|
|
would contain data for a day or two but no extra data. This could be
|
|
explained by filesystem rollback on the client.
|
|
|
|
A significant percentage of the records on the server belonged to
|
|
misbehaving clients. Identifying these records was extremely resource
|
|
intensive and error-prone. These records were undermining the ability
|
|
to use Firefox Health Report data.
|
|
|
|
Thus, the *Client ID* was born. The intent of the *Client ID* was to
|
|
uniquely identify clients so the extreme effort required and the
|
|
questionable reliability of deduplicating server data would become
|
|
problems of the past.
|
|
|
|
The *Client ID* was originally a randomly-generated UUID (version 1). This
|
|
allowed detection of orphaning and rollback. However, these version 1
|
|
*Client IDs* were still susceptible to use on multiple profiles and
|
|
machines if the profile was copied.
|