collector/doc/manual.tex

\documentclass{article}
\begin{document}
\title{ERNIE: a tool to study the Tor network\\-- User's Guide --}
\author{by Karsten Loesing \texttt{<karsten@torproject.org>}}
\maketitle

\section{Overview}

Welcome to ERNIE!
ERNIE is a tool to study the Tor network.
ERNIE has been designed to process all kinds of data about the Tor network
and visualize them or prepare them for further analysis.
ERNIE is also the software behind the Tor Metrics Portal
\verb+http://metrics.torproject.org/+.

The acronym ERNIE stands for the \emph{Enhanced R-based tor Network
Intelligence Engine} (sorry for misspelling Tor).
Why ERNIE?
Because nobody liked BIRT (Business Intelligence and Reporting Tools) that
we used for visualizing statistics about the Tor network before writing
our own software.
By the way, reasons were that BIRT made certain people's browsers crash
and requires JavaScript that most Tor user have turned off.

If you want to learn more about the Tor network, regardless of whether you
want to present your findings on a website (like ERNIE does) or include
them in your next Tor paper, this user's guide is for you!

\section{Getting started with ERNIE}

The ERNIE project was started as a simple tool to parse Tor relay
descriptors and plot graphs on Tor network usage for a website.
Since then, ERNIE has grown to a tool that can process all kinds of Tor
network data for various purposes, including but not limited to
visualization.

We think that the easiest way to get started with ERNIE is to walk through
typical use cases in a tutorial style and explain what is required to set
up ERNIE.
These use cases have been chosen from what we think are typical
applications of ERNIE.

\subsection{Visualizing network statistics}

{\it Write me.}

\subsection{Importing relay descriptors into a database}

As of February 2010, the relays and directories in the Tor network
generate more than 1 GB of descriptors every month.
There are two approaches to process these amounts of data:
extract only the relevant data for the analysis and write them to files,
or import all data to a database and run queries on the database.
ERNIE currently takes the file-based approach for the Metrics Portal,
which works great for standardized analyses.
But the more flexible way to research the Tor network is to work with a
database.

This tutorial describes how to import relay descriptors into a database
and run a few example queries.
Note that the presented database schema is limited to answering basic
questions about the Tor network.
In order to answer more complex questions, one would have to extend the
database schema and Java classes which is sketched at the end of this
tutorial.

\subsubsection{Preparing database for data import}

The first step in importing relay descriptors into a database is to
install a database management system.
We won't go into the details of installing a database for the various
operating systems in this tutorial.
Please consult the tutorials and manuals that are out on the Web.
For this tutorial, we assume that you have PostgreSQL 8.4 installed.
Note that in theory, any other relational database that has a working JDBC
4 driver should work, too, possibly with minor modifications to ERNIE.
We further assume a database user called \verb+ernie+ that is allowed to
define, modify, and query database objects.

First, create a new database schema \verb+tordir+ with two tables that we
need for importing relay descriptors, plus two indexes to accelerate
queries. Note that \verb+$+ denotes a shell prompt and \verb+tordir=>+ the
database prompt.

\begin{verbatim}
$ createdb -U ernie -O ernie tordir
$ psql -U ernie tordir
tordir=> CREATE TABLE statusentry (
  validafter TIMESTAMP NOT NULL,
  descriptor CHAR(40) NOT NULL,
  isauthority BOOLEAN NOT NULL DEFAULT false,
  isbadexit BOOLEAN NOT NULL DEFAULT false,
  isbaddirectory BOOLEAN NOT NULL DEFAULT false,
  isexit BOOLEAN NOT NULL DEFAULT false,
  isfast BOOLEAN NOT NULL DEFAULT false,
  isguard BOOLEAN NOT NULL DEFAULT false,
  ishsdir BOOLEAN NOT NULL DEFAULT false,
  isnamed BOOLEAN NOT NULL DEFAULT false,
  isstable BOOLEAN NOT NULL DEFAULT false,
  isrunning BOOLEAN NOT NULL DEFAULT false,
  isunnamed BOOLEAN NOT NULL DEFAULT false,
  isvalid BOOLEAN NOT NULL DEFAULT false,
  isv2dir BOOLEAN NOT NULL DEFAULT false,
  isv3dir BOOLEAN NOT NULL DEFAULT false,
  PRIMARY KEY (validafter, descriptor));
tordir=> CREATE TABLE descriptor (
  descriptor CHAR(40) NOT NULL PRIMARY KEY,
  address VARCHAR(15) NOT NULL,
  orport INTEGER NOT NULL,
  dirport INTEGER NOT NULL,
  bandwidthavg BIGINT NOT NULL,
  bandwidthburst BIGINT NOT NULL,
  bandwidthobserved BIGINT NOT NULL,
  platform VARCHAR(256),
  published TIMESTAMP NOT NULL,
  uptime BIGINT);
tordir=> CREATE INDEX statusvalidafter
  ON statusentry (validafter);
tordir=> CREATE INDEX descriptorid
  ON descriptor (descriptor);
tordir=> \q
\end{verbatim}

A row in the \verb+statusentry+ table contains the information that a
given relay (that has published the server descriptor with ID
\verb+descriptor+) was contained in the network status consensus published
at time \verb+validafter+.
These two fields uniquely identify a row in the \verb+statusentry+ table.
The other fields contain boolean values for the flags that the directory
authorities assigned to the relay in this consensus, e.g., the Exit flag
in \verb+isexit+.
Note that for the 24 network status consensuses of a given day with each
of them containing 2000 relays, there will be $24 \times 2000$ rows in the
\verb+statusentry+ table.

The \verb+descriptor+ table contains some portion of the information that
a relay includes in its server descriptor.
Descriptors are identified by the \verb+descriptor+ field which
corresponds to the \verb+descriptor+ field in the \verb+statusentry+
table.
The other fields contain further data of the server descriptor that might
be relevant for analyses, e.g., the platform line with the Tor software
version and operating system of the relay.

Obviously, this data schema doesn't match everyone's needs.
See the instructions below for extending ERNIE to import other data into
the database.

\subsubsection{Downloading relay descriptors from the metrics website}

In the next step you will probably want to download relay descriptors from
the metrics website
\verb+http://metrics.torproject.org/data.html#relaydesc+.
Download the \verb+v3 consensuses+ and/or \verb+server descriptors+ of the
months you want to analyze.
The server descriptors are the documents that relays publish at least
every 18 hours describing their capabilities, whereas the v3 consensuses
are views of the directory authorities on the available relays at a given
time.
For this tutorial you need both v3 consensuses and server descriptors.
You might want to start with a single month of data, experiment with it,
and import more data later on.
Extract the tarballs to a new directory \verb+archives/+ in the ERNIE
working directory.

\subsubsection{Configuring ERNIE to import relay descriptors into a
database}

ERNIE can be used to read data from one or more data sources and write
them to one or more data sinks.
You need to configure ERNIE so that it knows to use the downloaded relay
descriptors as data source and the database as data sink.
You have implicitly accomplished the former by creating the
\verb+archives/+ directory.
By default, ERNIE looks for this directory and tries to import everything
contained in it.
You could change this behavior by explicitly telling ERNIE not to import
data from the \verb+archives/+ directory by adding a line
\verb+ImportDirectoryArchives 0+ to the config file, but this is not what
we want in this tutorial.
But you need to explicitly enable your database as a data sink.
Add the following line to your \verb+config+ file:

\begin{verbatim}
WriteRelayDescriptorDatabase 1
\end{verbatim}

You further need to provide the JDBC string that ERNIE shall use to access
the database schema \verb+tordir+ that we created above.
The config option with the JDBC string for a local PostgreSQL database
might be (without line break):

\begin{verbatim}
RelayDescriptorDatabaseJDBC
  jdbc:postgresql:tordir?user=ernie&password=password
\end{verbatim}

\subsubsection{Importing relay descriptors using ERNIE}

Now you are ready to actually import relay descriptors using ERNIE.
Compile the Java classes and run ERNIE.

\begin{verbatim}
$ ./download.sh
$ ./run.sh
\end{verbatim}

Note that the import process might take between a few minutes and an hour,
depending on your hardware.
You will notice that ERNIE doesn't progress messages to the standard
output.
You can either change this behavior by setting
\verb+java.util.logging.ConsoleHandler.level+ in
\verb+logging.properties+ to \verb+INFO+ or \verb+FINE+.
Alternately, you can look at the log file \verb+log.0+ that is created by
ERNIE.

If ERNIE finishes after a few seconds, you have probably put the relay
descriptors at the wrong place.
Make sure that you extract the relay descriptors to sub directories of
\verb+archives/+ in the ERNIE working directory.

If you interrupt ERNIE, or if ERNIE terminates uncleanly for some reason,
you will have problems starting it the next time.
ERNIE uses a local lock file called \verb+lock+ to make sure that only a
single instance of ERNIE is running at a time.
If you are sure that the last ERNIE instance isn't running anymore, you
can remove the lock file and start ERNIE again.

If all goes well, you should now have the relay descriptors of 1 month in
your database.

\subsubsection{Example queries}

In this tutorial, we want to give you a few examples for using the
database schema with the imported relay descriptors to extract some useful
statistics about the Tor network.

In the first example we want to find out how many relays have been running
on average per day and how many of these relays were exit relays.
We only need the \verb+statusentry+ table for this evaluation, because
the information we are interested in is contained in the network status
consensuses.

The SQL statement that we need for this evaluation consists of two parts:
First, we find out how many network status consensuses have been published
on any given day.
Second, we count all relays and those with the Exit flag and divide these
numbers by the number of network status consensuses per day.

\begin{verbatim}
$ psql -U ernie tordir
tordir=> SELECT DATE(validafter),
    COUNT(*) / relay_statuses_per_day.count AS avg_running,
    SUM(CASE WHEN isexit IS TRUE THEN 1 ELSE 0 END) /
      relay_statuses_per_day.count AS avg_exit
  FROM statusentry,
    (SELECT COUNT(*) AS count, DATE(validafter) AS date
      FROM (SELECT DISTINCT validafter FROM statusentry)
      distinct_consensuses
      GROUP BY DATE(validafter)) relay_statuses_per_day
  WHERE DATE(validafter) = relay_statuses_per_day.date
  GROUP BY DATE(validafter), relay_statuses_per_day.count
  ORDER BY DATE(validafter);
tordir=> \q
\end{verbatim}

Executing this query should finish within a few seconds to one minute,
again depending on your hardware.
The result might start like this (truncated here):

\begin{verbatim}
    date    | avg_running | avg_exit
------------+-------------+----------
 2010-02-01 |        1583 |      627
 2010-02-02 |        1596 |      638
 2010-02-03 |        1600 |      654
:
\end{verbatim}

In the second example we want to find out what Tor software versions the
relays have been running.
More precisely, we want to know how many relays have been running what Tor
versions on micro version granularity (e.g., 0.2.2) on average per day?

We need to combine network status consensuses with server descriptors to
find out this information, because the version information is not
contained in the consensuses (or at least, it's optional to be contained
in there; and after all, this is just an example).
Note that we cannot focus on server descriptors only and leave out the
consensuses for this analysis, because we want our analysis to be limited
to running relays as confirmed by the directory authorities and not
include all descriptors that happened to be published at a given day.

The SQL statement again determines the number of consensuses per day in a
sub query.
In the next step, we join the \verb+statusentry+ table with the
\verb+descriptor+ table for all rows contained in the \verb+statusentry+
table.
The left join means that we include \verb+statusentry+ rows even if we do
not have corresponding lines in the \verb+descriptor+ table.
We determine the version by skipping the first 4 characters of the platform
string that should contain \verb+"Tor "+ (without quotes) and cutting off
after another 5 characters.
Obviously, this approach is prone to errors if the platform line format
changes, but it should be sufficient for this example.

\begin{verbatim}
$ psql -U ernie tordir
tordir=> SELECT DATE(validafter) AS date,
    SUBSTRING(platform, 5, 5) AS version,
    COUNT(*) / relay_statuses_per_day.count AS count
  FROM
    (SELECT COUNT(*) AS count, DATE(validafter) AS date
    FROM (SELECT DISTINCT validafter
      FROM statusentry) distinct_consensuses
    GROUP BY DATE(validafter)) relay_statuses_per_day
  JOIN statusentry
    ON relay_statuses_per_day.date = DATE(validafter)
  LEFT JOIN descriptor
    ON statusentry.descriptor = descriptor.descriptor
  GROUP BY DATE(validafter), SUBSTRING(platform, 5, 5),
    relay_statuses_per_day.count, relay_statuses_per_day.date
  ORDER BY DATE(validafter), SUBSTRING(platform, 5, 5);
tordir=> \q
\end{verbatim}

Running this query takes longer than the first query, which can be a few
minutes to half an hour.
The main reason is that joining the two tables is an expensive database
operation.
If you plan to perform many evaluations like this one, you might want to
create a third table that holds the results of joining the two tables of
this tutorial.
Creating such a table to speed up queries is not specific to ERNIE and
beyond the scope of this tutorial.

The (truncated) result of the query might look like this:

\begin{verbatim}
    date    | version | count
------------+---------+-------
 2010-02-01 | 0.1.2   |    10
 2010-02-01 | 0.2.0   |   217
 2010-02-01 | 0.2.1   |   774
 2010-02-01 | 0.2.2   |    75
 2010-02-01 |         |   505
 2010-02-02 | 0.1.2   |    14
 2010-02-02 | 0.2.0   |   328
 2010-02-02 | 0.2.1   |  1143
 2010-02-02 | 0.2.2   |   110
:
\end{verbatim}

Note that, in the fifth line, we are missing the server descriptors of 505
relays contained in network status consensuses published on 2010-02-01.
If you want to avoid such missing values, you'll have to import the server
descriptors of the previous month, too.

\subsubsection{Extending ERNIE to import further data into the database}

In this tutorial we have explained how to prepare a database, download
relay descriptors, configure ERNIE, import the descriptors, and execute
example queries.
This description is limited to a few examples by the very nature of a
tutorial.
If you want to extend ERNIE to import further data into your database,
you will have to perform at least two steps:
extend the database schema and modify the Java classes used for parsing.

The first step, extending the database schema, is not specific to ERNIE.
Just add the fields and tables to the schema definition.

The second step, modifying the Java classes used for parsing, is of course
specific to ERNIE.
You will have to look at two classes in particular:
The first class, \verb+RelayDescriptorDatabaseImporter+, contains the
prepared statements and methods used to add network status consensus
entries and server descriptors to the database.
The second class, \verb+RelayDescriptorParser+, contains the parsing logic
for the relay descriptors and decides what information to add to the
database, among other things.

This ends the tutorial on importing relay descriptors into a database.
Happy researching!

\subsection{Aggregating relay and bridge descriptors}

{\it Write me.}

\section{Software architecture}

{\it Write me. In particular, include overview of components:

\begin{itemize}
\item Data sources and data sinks
\item Java classes with data sources and data sinks
\item R scripts to process CSV output
\item Website
\end{itemize}
}

\section{Tor Metrics Portal setup}

{\it
Write me. In particular, include documentation of deployed ERNIE that
runs the metrics website.
This documentation has two purposes:
First, a reference setup can help others creating their own ERNIE
configuration that goes beyond the use cases as described above.
Second, we need to remember how things are configured anyway, so we can
as well document them here.}

\end{document}