The separation between BridgeSnapshotReader, BridgeDescriptorParser,
and SanitizedBridgesWriter doesn't make much sense anymore:
- BridgeSnapshotReader only has a constructor of more than 200 lines
of code.
- BridgeDescriptorParser actually only determines the descriptor type
and
- SanitizedBridgesWriter performs parsing and obfuscation.
There are better ways to structure this code. The first step in that
direction is to remove clutter by moving the code to read bridge
snapshots to SanitizedBridgesWriter and deleting the other two
classes.
Part of #20542.
The indexer did not handle a (mostly theoretic) edge case of a file
being moved away and then moved back shortly after. In such a case the
file should not be marked for deletion anymore and it should be
included in the index again. That's what this commit does.
The other minor changes to unit tests are just cosmetic.
Fixes#34030.
One of the previously made changes to cleaning up directories was that
empty directories were deleted. This was necessary, because otherwise
there would be a growing number of directories as files get deleted
after reaching an age of seven weeks.
However, this change should not have included deleting the cleaned up
directory itself. In practice, this will not happen. But in tests it's
certainly possible that a directory is empty and then gets deleted.
This leads to all sorts of problems in tests.
The fix is to limit deleting empty directories to subdirectories.
That's what this commit does.
These lines have been added by proposal 313 and are usually not
included by bridges. But apparently some bridges include them anyway,
probably bridges that have been configured as non-bridge relays
before. We should retain them just like we retain other statistics
lines.
The three recently added modules to archive Snowflake statistics,
bridge pool assignments, and BridgeDB metrics have in common that they
process any input files regardless of whether they already processed
them before.
The problem is that the input files processed by these modules are
either never removed (Snowflake statistics) or only removed manually
by the operator (bridge pool assignments and BridgeDB statistics).
The effect is that non-recent BridgeDB metrics and bridge pool
assignments are being placed in the indexed/recent/ directory in the
next execution after they are deleted for being older than 72 hours.
The same would happen with Snowflake statistics after the operator
removes them from the out/ directory.
The fix is to use a state file containing file names of previously
processed files and only process a file not found in there. This is
the same approach as taken for bridge descriptor tarballs.
Web servers typically provide us with the last 14 days of request
logs. We shouldn't process the whole 14 days over and over. Instead we
should only process new logs files and any other log files containing
log lines from newly written dates.
In some cases web servers stop serving a given virtual host or stop
acting as web server at all. However, in these cases we're left with
14 days of logs per virtual host. Ideally, these logs would get
cleaned up, but until that's the case, we should at least not
reprocess these files over and over.
In order to avoid reprocessing webstats files, we need a new state
file with log dates contained in given input files. We use that state
file to determine which of the previously processed webstats files to
re-process, so that we can write complete daily logs.
The only functionality contained in metrics-lib's internal package is
file (de-)compression, which in turn uses a third-party library that
we're using anyway. This is a weak reason for depending on our own
library for this functionality. Removing this dependency will make it
easier to make changes to our library in the future.
The new FileType class is based on a copy of the same enum type in
metrics-lib without @since tags and without methods that we don't use.