These lines have been added by proposal 313 and are usually not
included by bridges. But apparently some bridges include them anyway,
probably bridges that have been configured as non-bridge relays
before. We should retain them just like we retain other statistics
lines.
The three recently added modules to archive Snowflake statistics,
bridge pool assignments, and BridgeDB metrics have in common that they
process any input files regardless of whether they already processed
them before.
The problem is that the input files processed by these modules are
either never removed (Snowflake statistics) or only removed manually
by the operator (bridge pool assignments and BridgeDB statistics).
The effect is that non-recent BridgeDB metrics and bridge pool
assignments are being placed in the indexed/recent/ directory in the
next execution after they are deleted for being older than 72 hours.
The same would happen with Snowflake statistics after the operator
removes them from the out/ directory.
The fix is to use a state file containing file names of previously
processed files and only process a file not found in there. This is
the same approach as taken for bridge descriptor tarballs.
Web servers typically provide us with the last 14 days of request
logs. We shouldn't process the whole 14 days over and over. Instead we
should only process new logs files and any other log files containing
log lines from newly written dates.
In some cases web servers stop serving a given virtual host or stop
acting as web server at all. However, in these cases we're left with
14 days of logs per virtual host. Ideally, these logs would get
cleaned up, but until that's the case, we should at least not
reprocess these files over and over.
In order to avoid reprocessing webstats files, we need a new state
file with log dates contained in given input files. We use that state
file to determine which of the previously processed webstats files to
re-process, so that we can write complete daily logs.
The only functionality contained in metrics-lib's internal package is
file (de-)compression, which in turn uses a third-party library that
we're using anyway. This is a weak reason for depending on our own
library for this functionality. Removing this dependency will make it
easier to make changes to our library in the future.
The new FileType class is based on a copy of the same enum type in
metrics-lib without @since tags and without methods that we don't use.
We were using the same path for BridgeDB metrics in out/ and recent/,
and file names didn't contain the "-bridgedb-metrics" suffix that we
intended to add.
We're now using paths generated by BridgedbMetricsPersistence.
Also update create-tarballs.sh to create BridgeDB metrics tarballs.
Still part of #19332.