unofficial git repo -- report bugs/issues/pull requests on https://gitlab.torproject.org/ --
Go to file
Karsten Loesing a4d184bf94 Store raw descriptors as byte[], offset, and length.
Prior to this commit we read raw descriptor bytes from disk, split
them into serveral byte[] for each contained descriptor, and stored
those copies together with descriptors.  We further copied descriptor
parts, like signatures or status entries, and stored those copies as
well.

Overall, we temporarily required up to 3 times the size of descriptor
files just to store raw descriptor contents: 1) the entire descriptor
file read to memory, 2) copies of all contained descriptors, and 3)
copies of contained descriptor parts.  After moving on to the next
descriptor file, 1) was freed, but 2) and 3) remained in memory.  This
was rather wasteful.

With this commit we store raw descriptors as reference to the byte[]
containing the entire descriptor file plus offset and length of the
part containing one descriptor.  Similarly we store raw descriptor
parts as a reference to the full descriptor plus offset and length of
the descriptor part.  This saves a lot of memory, and it avoids
unnecessary array copying.

This change is also a step towards not storing raw descriptor contents
in memory at all, but instead leaving contents on disk and accessing
parts as needed.  However, this commit does not take that step yet.

The original purpose of this commit was to prepare switching from the
platform's default charset to UTF-8 for #21932.  The idea was to
reduce access to DescriptorImpl#rawDescriptorBytes and add all methods
working on those bytes, including converting them to a String, to
DescriptorImpl.  This commit achieves this purpose by preparing that
switch, yet it does not take that step, either.  Switching to UTF-8 is
midly backward-incompatible, so it'll have to wait until 2.0.0.
However, switching will be much easier based on the changes in this
commit.

Many of these changes in this commit are interdependent which makes it
difficult to split up this commit with reasonable effort.  Still, in
order to facilitate reviews, here is an explanation of changes made in
this commit from top to bottom:

Move all code for processing raw descriptor bytes from a) detecting
the descriptor type, b) finding descriptor starts and ends, up to c)
invoking the right DescriptorImpl subclass constructors from
DescriptorImpl and its subclasses over to DescriptorParserImpl.

Include offset and limit in the constructors of DescriptorImpl and
most of its subclasses.

Refer to directory and network status parts in RelayDirectoryImpl and
NetworkStatusImpl and its subclasses by offset and length rather than
passing copies of raw descriptors.

Provide two overloaded methods DescriptorImpl#newScanner() that
internally handle the byte[]-to-String conversion rather than leaving
this task to all DescriptorImpl subclasses.

In DescriptorImpl, rather than storing a copy of raw descriptor bytes
per descriptor, store a reference to a potentially larger byte[],
containing all descriptors read from a given file, together with
offset and length.

Provide various methods in DescriptorImpl that provide access to raw
descriptor bytes and that internally handle issues like unified
character encoding.

Include an XXX21932 tag in all places where byte[] is currently
converted to String using the platform's default charset.

Update existing methods in DescriptorImpl to only access
rawDescriptorBytes within offset and offset + length.

In classes referenced from DescriptorImpl subclasses, like
DirSourceEntryImpl and NetworkStatusEntryImpl, rather than storing a
copy of raw descriptor bytes, store a reference to the parent
DescriptorImpl instance together with offset and length.

Change raw descriptor bytes in ExitListEntryImpl into a String,
because the byte[] we stored there was never read from disk but
generated by ourselves using String#getBytes() using the platform's
default charset.  We also never used raw bytes in ExitListEntryImpl
anyway.  Admittedly, we could use offset and length there, too, but
the amount of saved memory is likely not worth the necessary code
changes.

Remove redundant zero-length checks from DescriptorImpl subclasses
including ExitListImpl, NetworkStatusImpl, and RelayDirectoryImpl.
These checks are redundant, because we already performed the same
checks in DescriptorImpl#countKeys().

Move commonly used helper methods for finding the first index of a
keyword or splitting descriptory by keyword from DescriptorImpl
subclasses, like NetworkStatusImpl and RelayDirectoryImpl, to
DescriptorImpl.

In test classes, replace the numerous invocations of DescriptorImpl
subclass constructors with local buildSomething() methods, so that
future changes to constructor signatures won't produce a diff as long
as this one.
2017-06-06 15:08:49 +02:00
src Store raw descriptors as byte[], offset, and length. 2017-06-06 15:08:49 +02:00
.gitignore Download using index.json; implements task-19791. 2016-08-24 21:52:38 +02:00
.gitmodules Implements task-20596: use metrics-base and reduced build.xml, 2017-01-05 15:35:32 +01:00
build.xml Bump version to 1.7.0-dev. 2017-05-17 14:00:41 +02:00
CERT Prepare for 1.7.0 release. 2017-05-16 16:53:49 +02:00
CHANGELOG.md Store raw descriptors as byte[], offset, and length. 2017-06-06 15:08:49 +02:00
CONTRIB.md Added development description. 2017-01-05 15:35:44 +01:00
LICENSE Update copyright. 2017-01-13 16:47:42 +01:00
README.md Add tutorial link and examples. 2017-03-13 20:18:21 +01:00

DescripTor -- A Tor Descriptor API for Java

DescripTor is a Java API that fetches Tor descriptors from a variety of sources like cached descriptors and directory authorities/mirrors. The DescripTor API is useful to support statistical analysis of the Tor network data and for building services and applications.

The descriptor types supported by DescripTor include relay and bridge descriptors which are part of Tor's directory protocol as well as Torperf data files and TorDNSEL's exit lists. Access to these descriptors is unified to facilitate access to publicly available data about the Tor network.

This API is designed for Java programs that process Tor descriptors in batches. A Java program using this API first sets up a descriptor source by defining where to find descriptors and which descriptors it considers relevant. The descriptor source then makes the descriptors available in a descriptor store. The program can then query the descriptor store for the contained descriptors. Changes to the descriptor sources after descriptors are made available in the descriptor store will not be noticed. This simple programming model was designed for periodically running, batch-processing applications and not for continuously running applications that rely on learning about changes to an underlying descriptor source.

The executable jar, source jar, and javadoc jar can be found in

generated/dist/

Before using them please verify the release (see below for instructions).

Verifying releases

Releases can be cryptographically verified to get some more confidence that they were put together by a Tor developer. The following steps explain the verification process by example.

Download the release tarball and the separate signature file:

wget https://dist.torproject.org/descriptor/1.0.0/descriptor-1.0.0.tar.gz
wget https://dist.torproject.org/descriptor/1.0.0/descriptor-1.0.0.tar.gz.asc

Attempt to verify the signature on the tarball:

gpg --verify descriptor-1.0.0.tar.gz.asc

If the signature cannot be verified due to the public key of the signer not being locally available, download that public key from one of the key servers and retry:

gpg --keyserver pgp.mit.edu --recv-key 0x4EFD4FDC3F46D41E
gpg --verify descriptor-1.0.0.tar.gz.asc

If the signature still cannot be verified, something is wrong!

But note that even if it can be verified, you now only know that the signature was made by the person claiming to own this key, which could be anyone. You'll need a trust path to the owner of this key in order to trust this signature, but that's clearly out of scope here. In short, your best chance is to meet a Tor developer in real life and enter the web of trust.

If you want to go one step further in the verification game, you can verify the signature on the .jar files.

Print and then import the provided X.509 certificate:

keytool -printcert -file CERT
keytool -importcert -alias karsten -file CERT

Verify the signatures on the contained .jar files using Java's jarsigner tool:

jarsigner -verify descriptor-1.0.0.jar
jarsigner -verify descriptor-1.0.0-sources.jar

Tutorial

The Metrics website has a tutorial for getting started with metrics-lib:

https://metrics.torproject.org/metrics-lib.html