Extra-info descriptors contain lots of comma-separated key=value lists
that we store in SortedMap instances. But those occupy a lot of memory,
and it's not certain that we'll ever want to use the contained keys or
values.
New approach: when parsing a descriptor, use regular expressions to check
if lines are valid, and delay parsing into maps until needed.
If we can easily determine the number of List or Set elements, we can as
well store their contents in arrays and convert those to List or Set
instances when requested. This can save us some memory and doesn't cost
much performance.
There have been at least two relays including an additional SP after their
nickname in server and extra-info descriptors. The spec stays vague about
whether this is allowed or not, but the directory authorities seem to
accept these descriptors just fine. We should also accept these
descriptors.
We should probably accept [SP|TAB]+ in more places. But right now we're
losing data by discarding these descriptors. Let's do the quick fix now
and the potentially cleaner fix later.
Fixes#12403.
DateFormat is not thread-safe, and creating a new instance every time we
need one only wastes CPU time. Make sure that there's a single instance
per thread and format that the thread can use whenever it wants.
We were storing bandwidth histories in TreeMap<Long, Long>() with keys
being time in millis and values being bandwidth values. This showed up in
profiles. It's far more (memory-)efficient to store bandwidth values in a
long[] and put together the TreeMap when the caller requests it. And if
the bandwidth history is evaluated exactly once, there should not even be
a CPU overhead.
By default, the descriptor reader puts up to 100 parsed descriptor files
in a queue in order to hand them out as quickly as possible. But if
descriptor files contain hundreds or even thousands of descriptors, that
default may be too high. Add a new method to make it configurable.
When we're parsing a descriptor file with potentially more than one
descriptor in it, we're converting file contents to String to be able to
search for descriptor beginnings using String methods. But we're not
passing a character encoding, leaving it up to Java to guess. What we
should do is tell it to use "US-ASCII" as encoding, which is sufficient to
find keywords marking the beginning of a new descriptor.
Fixes#11821.
The local lib directory is not used anymore and respective references
were removed. The java dependencies are now specified in the build.xml
and taken from their installed locations.
In addition to git, openjdk-6-jdk and ant the following java packages
have to be installed:
- libcommons-codec-java
- libcommons-compress-java
- junit4
Minor tweaks by Karsten Loesing.
Milliseconds are simply ignored, because SimpleDateFormat only looks at
"yyyy-MM-dd HH:mm:ss" and ignores everything after that.
Related to #9286 where we discovered that some relays include milliseconds
in their descriptors.
For example, see "other" entry in:
exit-kibibytes-read 80=505190490,182=25102395,443=61873906,
6881=47999666,8989=8657674,17173=7910494,21762=9138992,
45682=5154543,50500=6086469,51413=62394452,other=2282907805
Sanitized bridge descriptors contain transport lines with just the
transport name. However, there are now relays including unsanitized
transport lines, most likely because of a configuration problem. Don't
reject the entire descriptor when encountering those lines.
So far, the only way to prevent files from being parsed repeatedly in
distinct runs was to specify a history file that only metrics-lib was
supposed to read and write. However, some applications may want to
specify the list of files to exclude themselves, or they may want to learn
which files have been excluded and which have been parsed. These
applications shouldn't be forced to mess with the history file.
Add three methods to the descriptor reader for these applications. They
should also play nicely together with the history file approach.
AFAIK, stem has methods with the same purpose but a slightly different
semantic.