When we switched from storing copies of byte[] to storing byte[]
references, offset, and length, getRawDescriptor() suddenly became
more expensive than before. (Before that, copying bytes in the first
place was always expensive.) If an application only calls
Descriptor#getRawDescriptorBytes() to learn the array length, there's
now a more efficient way to provide that information.
This change simplifies the DescriptorReader interface and allows for
shorter and more concise application code. The result is that
DescriptorReader returns Descriptor instances rather than
DescriptorFile instances containing Descriptors and accepts a maximum
queue size in Descriptors, DescriptorFile is deprecated, and
Descriptor contains a File reference to the descriptor file.
Implements #22141.
Related to this change, this commit introduces a new
UnparseableDescriptor to be returned by DescriptorParser and
DescriptorReader if a descriptor cannot be parsed, as opposed to
throwing a DescriptorParseException or skipping the entire descriptor
file (fixes#22139), respectively.
Also related to this change, DescriptorParser now returns an Iterable
instead of a List, which prepares parsing large descriptor files
descriptor by descriptor (will be tackled in #20395).
Prior to this commit we read raw descriptor bytes from disk, split
them into serveral byte[] for each contained descriptor, and stored
those copies together with descriptors. We further copied descriptor
parts, like signatures or status entries, and stored those copies as
well.
Overall, we temporarily required up to 3 times the size of descriptor
files just to store raw descriptor contents: 1) the entire descriptor
file read to memory, 2) copies of all contained descriptors, and 3)
copies of contained descriptor parts. After moving on to the next
descriptor file, 1) was freed, but 2) and 3) remained in memory. This
was rather wasteful.
With this commit we store raw descriptors as reference to the byte[]
containing the entire descriptor file plus offset and length of the
part containing one descriptor. Similarly we store raw descriptor
parts as a reference to the full descriptor plus offset and length of
the descriptor part. This saves a lot of memory, and it avoids
unnecessary array copying.
This change is also a step towards not storing raw descriptor contents
in memory at all, but instead leaving contents on disk and accessing
parts as needed. However, this commit does not take that step yet.
The original purpose of this commit was to prepare switching from the
platform's default charset to UTF-8 for #21932. The idea was to
reduce access to DescriptorImpl#rawDescriptorBytes and add all methods
working on those bytes, including converting them to a String, to
DescriptorImpl. This commit achieves this purpose by preparing that
switch, yet it does not take that step, either. Switching to UTF-8 is
midly backward-incompatible, so it'll have to wait until 2.0.0.
However, switching will be much easier based on the changes in this
commit.
Many of these changes in this commit are interdependent which makes it
difficult to split up this commit with reasonable effort. Still, in
order to facilitate reviews, here is an explanation of changes made in
this commit from top to bottom:
Move all code for processing raw descriptor bytes from a) detecting
the descriptor type, b) finding descriptor starts and ends, up to c)
invoking the right DescriptorImpl subclass constructors from
DescriptorImpl and its subclasses over to DescriptorParserImpl.
Include offset and limit in the constructors of DescriptorImpl and
most of its subclasses.
Refer to directory and network status parts in RelayDirectoryImpl and
NetworkStatusImpl and its subclasses by offset and length rather than
passing copies of raw descriptors.
Provide two overloaded methods DescriptorImpl#newScanner() that
internally handle the byte[]-to-String conversion rather than leaving
this task to all DescriptorImpl subclasses.
In DescriptorImpl, rather than storing a copy of raw descriptor bytes
per descriptor, store a reference to a potentially larger byte[],
containing all descriptors read from a given file, together with
offset and length.
Provide various methods in DescriptorImpl that provide access to raw
descriptor bytes and that internally handle issues like unified
character encoding.
Include an XXX21932 tag in all places where byte[] is currently
converted to String using the platform's default charset.
Update existing methods in DescriptorImpl to only access
rawDescriptorBytes within offset and offset + length.
In classes referenced from DescriptorImpl subclasses, like
DirSourceEntryImpl and NetworkStatusEntryImpl, rather than storing a
copy of raw descriptor bytes, store a reference to the parent
DescriptorImpl instance together with offset and length.
Change raw descriptor bytes in ExitListEntryImpl into a String,
because the byte[] we stored there was never read from disk but
generated by ourselves using String#getBytes() using the platform's
default charset. We also never used raw bytes in ExitListEntryImpl
anyway. Admittedly, we could use offset and length there, too, but
the amount of saved memory is likely not worth the necessary code
changes.
Remove redundant zero-length checks from DescriptorImpl subclasses
including ExitListImpl, NetworkStatusImpl, and RelayDirectoryImpl.
These checks are redundant, because we already performed the same
checks in DescriptorImpl#countKeys().
Move commonly used helper methods for finding the first index of a
keyword or splitting descriptory by keyword from DescriptorImpl
subclasses, like NetworkStatusImpl and RelayDirectoryImpl, to
DescriptorImpl.
In test classes, replace the numerous invocations of DescriptorImpl
subclass constructors with local buildSomething() methods, so that
future changes to constructor signatures won't produce a diff as long
as this one.
The main intention behind this change is to reduce the number of
places in the code where byte[] is converted to String. But another
reason is to reduce code duplication, which would have been sufficient
to make this change.
According to the specification it's valid to add extra arguments to
descriptor lines unless they are tagged with "[No extra arguments]".
This is not the case for any of the statistics-related lines in
extra-info descriptors, so we should allow extra arguments there.
Fixes#21934.
OnionPerf adds six new key-value pairs to the .tpf format that
Torperf/CollecTor did not produce: ENDPOINTLOCAL, ENDPOINTPROXY,
ENDPOINTREMOTE, HOSTNAMELOCAL, HOSTNAMEREMOTE, and SOURCEADDRESS.
We should add support for these keys to metrics-lib, so that we can
start using their values.
Implements #22122.