a4d184bf94
Prior to this commit we read raw descriptor bytes from disk, split them into serveral byte[] for each contained descriptor, and stored those copies together with descriptors. We further copied descriptor parts, like signatures or status entries, and stored those copies as well. Overall, we temporarily required up to 3 times the size of descriptor files just to store raw descriptor contents: 1) the entire descriptor file read to memory, 2) copies of all contained descriptors, and 3) copies of contained descriptor parts. After moving on to the next descriptor file, 1) was freed, but 2) and 3) remained in memory. This was rather wasteful. With this commit we store raw descriptors as reference to the byte[] containing the entire descriptor file plus offset and length of the part containing one descriptor. Similarly we store raw descriptor parts as a reference to the full descriptor plus offset and length of the descriptor part. This saves a lot of memory, and it avoids unnecessary array copying. This change is also a step towards not storing raw descriptor contents in memory at all, but instead leaving contents on disk and accessing parts as needed. However, this commit does not take that step yet. The original purpose of this commit was to prepare switching from the platform's default charset to UTF-8 for #21932. The idea was to reduce access to DescriptorImpl#rawDescriptorBytes and add all methods working on those bytes, including converting them to a String, to DescriptorImpl. This commit achieves this purpose by preparing that switch, yet it does not take that step, either. Switching to UTF-8 is midly backward-incompatible, so it'll have to wait until 2.0.0. However, switching will be much easier based on the changes in this commit. Many of these changes in this commit are interdependent which makes it difficult to split up this commit with reasonable effort. Still, in order to facilitate reviews, here is an explanation of changes made in this commit from top to bottom: Move all code for processing raw descriptor bytes from a) detecting the descriptor type, b) finding descriptor starts and ends, up to c) invoking the right DescriptorImpl subclass constructors from DescriptorImpl and its subclasses over to DescriptorParserImpl. Include offset and limit in the constructors of DescriptorImpl and most of its subclasses. Refer to directory and network status parts in RelayDirectoryImpl and NetworkStatusImpl and its subclasses by offset and length rather than passing copies of raw descriptors. Provide two overloaded methods DescriptorImpl#newScanner() that internally handle the byte[]-to-String conversion rather than leaving this task to all DescriptorImpl subclasses. In DescriptorImpl, rather than storing a copy of raw descriptor bytes per descriptor, store a reference to a potentially larger byte[], containing all descriptors read from a given file, together with offset and length. Provide various methods in DescriptorImpl that provide access to raw descriptor bytes and that internally handle issues like unified character encoding. Include an XXX21932 tag in all places where byte[] is currently converted to String using the platform's default charset. Update existing methods in DescriptorImpl to only access rawDescriptorBytes within offset and offset + length. In classes referenced from DescriptorImpl subclasses, like DirSourceEntryImpl and NetworkStatusEntryImpl, rather than storing a copy of raw descriptor bytes, store a reference to the parent DescriptorImpl instance together with offset and length. Change raw descriptor bytes in ExitListEntryImpl into a String, because the byte[] we stored there was never read from disk but generated by ourselves using String#getBytes() using the platform's default charset. We also never used raw bytes in ExitListEntryImpl anyway. Admittedly, we could use offset and length there, too, but the amount of saved memory is likely not worth the necessary code changes. Remove redundant zero-length checks from DescriptorImpl subclasses including ExitListImpl, NetworkStatusImpl, and RelayDirectoryImpl. These checks are redundant, because we already performed the same checks in DescriptorImpl#countKeys(). Move commonly used helper methods for finding the first index of a keyword or splitting descriptory by keyword from DescriptorImpl subclasses, like NetworkStatusImpl and RelayDirectoryImpl, to DescriptorImpl. In test classes, replace the numerous invocations of DescriptorImpl subclass constructors with local buildSomething() methods, so that future changes to constructor signatures won't produce a diff as long as this one. |
||
---|---|---|
src | ||
.gitignore | ||
.gitmodules | ||
build.xml | ||
CERT | ||
CHANGELOG.md | ||
CONTRIB.md | ||
LICENSE | ||
README.md |
DescripTor -- A Tor Descriptor API for Java
DescripTor is a Java API that fetches Tor descriptors from a variety of sources like cached descriptors and directory authorities/mirrors. The DescripTor API is useful to support statistical analysis of the Tor network data and for building services and applications.
The descriptor types supported by DescripTor include relay and bridge descriptors which are part of Tor's directory protocol as well as Torperf data files and TorDNSEL's exit lists. Access to these descriptors is unified to facilitate access to publicly available data about the Tor network.
This API is designed for Java programs that process Tor descriptors in batches. A Java program using this API first sets up a descriptor source by defining where to find descriptors and which descriptors it considers relevant. The descriptor source then makes the descriptors available in a descriptor store. The program can then query the descriptor store for the contained descriptors. Changes to the descriptor sources after descriptors are made available in the descriptor store will not be noticed. This simple programming model was designed for periodically running, batch-processing applications and not for continuously running applications that rely on learning about changes to an underlying descriptor source.
The executable jar, source jar, and javadoc jar can be found in
generated/dist/
Before using them please verify the release (see below for instructions).
Verifying releases
Releases can be cryptographically verified to get some more confidence that they were put together by a Tor developer. The following steps explain the verification process by example.
Download the release tarball and the separate signature file:
wget https://dist.torproject.org/descriptor/1.0.0/descriptor-1.0.0.tar.gz
wget https://dist.torproject.org/descriptor/1.0.0/descriptor-1.0.0.tar.gz.asc
Attempt to verify the signature on the tarball:
gpg --verify descriptor-1.0.0.tar.gz.asc
If the signature cannot be verified due to the public key of the signer not being locally available, download that public key from one of the key servers and retry:
gpg --keyserver pgp.mit.edu --recv-key 0x4EFD4FDC3F46D41E
gpg --verify descriptor-1.0.0.tar.gz.asc
If the signature still cannot be verified, something is wrong!
But note that even if it can be verified, you now only know that the signature was made by the person claiming to own this key, which could be anyone. You'll need a trust path to the owner of this key in order to trust this signature, but that's clearly out of scope here. In short, your best chance is to meet a Tor developer in real life and enter the web of trust.
If you want to go one step further in the verification game, you can verify the signature on the .jar files.
Print and then import the provided X.509 certificate:
keytool -printcert -file CERT
keytool -importcert -alias karsten -file CERT
Verify the signatures on the contained .jar files using Java's jarsigner tool:
jarsigner -verify descriptor-1.0.0.jar
jarsigner -verify descriptor-1.0.0-sources.jar
Tutorial
The Metrics website has a tutorial for getting started with metrics-lib: