Rewrite documentation.

This commit is contained in:
Karsten Loesing 2015-05-16 11:40:25 +02:00
parent eab1ebc85a
commit 94c7158834
5 changed files with 80 additions and 557 deletions

2
.gitignore vendored
View File

@ -5,8 +5,6 @@ config
data/
in/
out/
doc/manual.aux
doc/manual.log
log/
rsync/
stats/

80
INSTALL.md Normal file
View File

@ -0,0 +1,80 @@
CollecTor -- Operator's Guide
=============================
Welcome to the Operator's Guide of CollecTor. This guide explains how
to set up a new CollecTor instance to download relay descriptors from the
Tor directory authorities.
Requirements
------------
You'll need a Linux host with at least 50G disk space and 2G RAM.
In the following we'll assume that the host runs Debian stable as
operating system, but it should work on any other Linux or possibly even
*BSD. Though you'll be mostly on your own with those.
Prepare the system
------------------
Create a working directory for CollecTor. In this guide, we'll assume
that you're using `/srv/collector.torproject.org/` as working directory,
but feel free to use another directory that better suits your needs.
$ sudo mkdir -p /srv/collector.torproject.org/
$ sudo chown vagrant:vagrant /srv/collector.torproject.org/
Install a few packages:
$ sudo apt-get install openjdk-6-jdk ant libcommons-codec-java \
libcommons-compress-java
Clone the metrics-db repository
-------------------------------
$ cd /srv/collector.torproject.org/
$ git clone https://git.torproject.org/metrics-db
Clone required submodule metrics-lib
------------------------------------
$ git submodule init
$ git submodule update
Compile CollecTor
-----------------
$ ant compile
Configure the relay descriptor downloader
-----------------------------------------
Edit the config file and uncomment and edit at least the following line:
DownloadRelayDescriptors 1
Run the relay descriptor downloader
-----------------------------------
$ bin/run-relaydescs
Set up an hourly cronjob for the relay descriptor downloader
------------------------------------------------------------
Ideally, run the relay descriptor downloader once per hour by adding a
crontab entry like the following:
6 * * * * cd /srv/collector.torproject.org/db/ && bin/run-relaydescs
Watch out for INFO-level logs in the `log/` directory. In particular, the
lines following "Statistics on the completeness of written relay
descriptors:" is quite important.

7
README
View File

@ -1,7 +0,0 @@
ERNIE is the Enhanced R-based tor Network Intelligence Engine
(why ERNIE? because nobody liked BIRT; sorry for misspelling Tor)
--------------------------------------------------------------------------
Please find documentation in doc/ .

Binary file not shown.

View File

@ -1,548 +0,0 @@
\documentclass{article}
\begin{document}
\title{ERNIE: a tool to study the Tor network\\-- User's Guide --}
\author{by Karsten Loesing \texttt{<karsten@torproject.org>}}
\maketitle
\section{Overview}
Welcome to ERNIE!
ERNIE is a tool to study the Tor network.
ERNIE has been designed to process all kinds of data about the Tor network
and visualize them or prepare them for further analysis.
ERNIE is also the software behind the Tor Metrics Portal
\verb+http://metrics.torproject.org/+.
The acronym ERNIE stands for the \emph{Enhanced R-based tor Network
Intelligence Engine} (sorry for misspelling Tor).
Why ERNIE?
Because nobody liked BIRT (Business Intelligence and Reporting Tools) that
we used for visualizing statistics about the Tor network before writing
our own software.
By the way, reasons were that BIRT made certain people's browsers crash
and requires JavaScript that most Tor user have turned off.
If you want to learn more about the Tor network, regardless of whether you
want to present your findings on a website (like ERNIE does) or include
them in your next Tor paper, this user's guide is for you!
\section{Installation instructions}
ERNIE depends on various other software tools. ERNIE is developed in a
\emph{Git} repository which is currently the only way to download it.
ERNIE uses \emph{Java} for parsing data, \emph{R} for plotting graphs,
and \emph{PostgreSQL} for importing data into a database.
Which of these tools you need depends on what tasks you are planning to
use ERNIE for.
In most cases it is not required to install all these tools.
For this tutorial, we assume Debian GNU/Linux 5.0 as operating system.
Installation instructions for other platforms may vary.
\subsection{Git 1.5.6.5}
Currently, the only way to download ERNIE is to clone its Git branch.
Install Git 1.5.6.5 (or higher) and check that it's working:
\begin{verbatim}
$ sudo apt-get install git-core
$ git --version
\end{verbatim}
\subsection{Java 6}
ERNIE requires Java to parse data from various data sources and write them
to one or more data sinks. Java is required for most use cases of ERNIE.
Add the non-free repository to the apt sources in
\verb+/etc/apt/sources.list+ by changing the line (mirror URL may vary):
\begin{verbatim}
deb http://ftp.ca.debian.org/debian/ lenny main
\end{verbatim}
to
\begin{verbatim}
deb http://ftp.ca.debian.org/debian/ lenny main non-free
\end{verbatim}
Fetch the package list, install Sun Java 6, and set it as system default:
\begin{verbatim}
$ sudo apt-get update
$ sudo apt-get install sun-java6-jdk
$ sudo update-alternatives --set java \
/usr/lib/jvm/java-6-sun/jre/bin/java
$ sudo update-alternatives --set javac \
/usr/lib/jvm/java-6-sun/bin/javac
\end{verbatim}
Check that Java 6 is installed and selected as default:
\begin{verbatim}
$ java -version
$ javac -version
\end{verbatim}
\subsection{Ant 1.7}
ERNIE comes with an Ant build file that facilitates common build tasks.
If you want to use Ant to build and run ERNIE, install Ant and check its
installed version (tested with 1.7):
\begin{verbatim}
$ sudo apt-get install ant
$ ant -version
\end{verbatim}
\subsection{R 2.8 and ggplot2}
ERNIE uses R and the R library \emph{ggplot2} to visualize anaylsis
results for presentation on a website or for inclusion in publications.
ggplot2 requires at least R version 2.8 to be installed.
Add a new line to \verb+/etc/apt/sources.list+:
\begin{verbatim}
deb http://cran.cnr.berkeley.edu/bin/linux/debian lenny-cran/
\end{verbatim}
Download the package maintainer's public key (``Johannes Ranke (CRAN Debian
archive) $<$jranke@uni-bremen.de$>$''):
\begin{verbatim}
$ gpg --keyserver pgpkeys.pca.dfn.de --recv-key 381BA480
$ gpg --export 381BA480 | sudo apt-key add -
\end{verbatim}
Install the most recent R version:
\begin{verbatim}
$ sudo apt-get update
$ sudo apt-get -t unstable install r-base
\end{verbatim}
Start R to check its version (must be 2.8 or higher) and install ggplot2.
Do this as root, so that the installed package is available to all system
users:
\begin{verbatim}
$ sudo R
> install.packages("ggplot2")
> q()
\end{verbatim}
Confirm that R and ggplot2 are installed:
\begin{verbatim}
$ R
> library(ggplot2)
> q()
\end{verbatim}
\subsection{PostgreSQL 8.3}
\label{sec-install-postgres}
ERNIE uses PostgreSQL to import data into a database for later analysis.
This feature is not required for most use cases of ERNIE, but only for
people who prefer having the network data in a database to execute custom
queries.
Install PostgreSQL 8.3 using apt-get:
\begin{verbatim}
$ sudo apt-get install postgresql-8.3
\end{verbatim}
Create a new database user \verb+ernie+ to insert data and run queries.
This command is executed as unix user \verb+postgres+ and therefore as
database superuser \verb+postgres+ via ident authentication. The
\verb+-P+ flag issues a password prompt for the new user.
There is no need to give the new user superuser privileges or allow it to
create databases or new roles.
\begin{verbatim}
$ sudo -u postgres createuser -P ernie
\end{verbatim}
Create a new database schema \verb+tordir+ owned by user \verb+ernie+
(using option \verb+-O+).
Again, this command is executed as \verb+postgres+ system user to make use
of ident authentication.
\begin{verbatim}
$ sudo -u postgres createdb -O ernie tordir
\end{verbatim}
Log into the database schema as user \verb+ernie+ to check that it's
working.
This time, ident authentication is not available, since there is no system
user \verb+ernie+.
Instead, we will use password authentication via a TCP connection to
localhost (using option \verb+-h+) as database user \verb+ernie+ (using
option \verb+-U+).
\begin{verbatim}
$ psql -h localhost -U ernie tordir
tordir=> \q
\end{verbatim}
\subsection{ERNIE}
Finally, you can install ERNIE by cloning its Git branch:
\begin{verbatim}
$ git clone git://git.torproject.org/ernie
\end{verbatim}
This command should create a directory \verb+ernie/+ which we will
consider the working directory of ERNIE.
\section{Getting started with ERNIE}
The ERNIE project was started as a simple tool to parse Tor relay
descriptors and plot graphs on Tor network usage for a website.
Since then, ERNIE has grown to a tool that can process all kinds of Tor
network data for various purposes, including but not limited to
visualization.
We think that the easiest way to get started with ERNIE is to walk through
typical use cases in a tutorial style and explain what is required to set
up ERNIE.
These use cases have been chosen from what we think are typical
applications of ERNIE.
\subsection{Visualizing network statistics}
{\it Write me.}
\subsection{Importing relay descriptors into a database}
As of February 2010, the relays and directories in the Tor network
generate more than 1 GB of descriptors every month.
There are two approaches to process these amounts of data:
extract only the relevant data for the analysis and write them to files,
or import all data to a database and run queries on the database.
ERNIE currently takes the file-based approach for the Metrics Portal,
which works great for standardized analyses.
But the more flexible way to research the Tor network is to work with a
database.
This tutorial describes how to import relay descriptors into a database
and run a few example queries.
Note that the presented database schema is limited to answering basic
questions about the Tor network.
In order to answer more complex questions, one would have to extend the
database schema and Java classes which is sketched at the end of this
tutorial.
\subsubsection{Preparing database for data import}
The first step in importing relay descriptors into a database is to
install a database management system.
See Section \ref{sec-install-postgres} for installation instructions of
PostgreSQL 8.3 on Debian GNU/Linux 5.0.
Note that in theory, any other relational database that has a working JDBC
4 driver should work, too, possibly with minor modifications to ERNIE.
Import the database schema from file \verb+db/tordir.sql+ containing two
tables that we need for importing relay descriptors plus two indexes to
accelerate queries. Check that tables have been created using \verb+\dt+.
You should see a list containing the two tables \verb+descriptor+ and
\verb+statusentry+.
\begin{verbatim}
$ psql -h localhost -U ernie -f db/tordir.sql tordir
$ psql -h localhost -U ernie tordir
tordir=> \dt
tordir=> \q
\end{verbatim}
A row in the \verb+statusentry+ table contains the information that a
given relay (that has published the server descriptor with ID
\verb+descriptor+) was contained in the network status consensus published
at time \verb+validafter+.
These two fields uniquely identify a row in the \verb+statusentry+ table.
The other fields contain boolean values for the flags that the directory
authorities assigned to the relay in this consensus, e.g., the Exit flag
in \verb+isexit+.
Note that for the 24 network status consensuses of a given day with each
of them containing 2000 relays, there will be $24 \times 2000$ rows in the
\verb+statusentry+ table.
The \verb+descriptor+ table contains some portion of the information that
a relay includes in its server descriptor.
Descriptors are identified by the \verb+descriptor+ field which
corresponds to the \verb+descriptor+ field in the \verb+statusentry+
table.
The other fields contain further data of the server descriptor that might
be relevant for analyses, e.g., the platform line with the Tor software
version and operating system of the relay.
Obviously, this data schema doesn't match everyone's needs.
See the instructions below for extending ERNIE to import other data into
the database.
\subsubsection{Downloading relay descriptors from the metrics website}
In the next step you will probably want to download relay descriptors from
the metrics website
\verb+http://metrics.torproject.org/data.html#relaydesc+.
Download the \verb+v3 consensuses+ and/or \verb+server descriptors+ of the
months you want to analyze.
The server descriptors are the documents that relays publish at least
every 18 hours describing their capabilities, whereas the v3 consensuses
are views of the directory authorities on the available relays at a given
time.
For this tutorial you need both v3 consensuses and server descriptors.
You might want to start with a single month of data, experiment with it,
and import more data later on.
Extract the tarballs to a new directory \verb+archives/+ in the ERNIE
working directory.
\subsubsection{Configuring ERNIE to import relay descriptors into a
database}
ERNIE can be used to read data from one or more data sources and write
them to one or more data sinks.
You need to configure ERNIE so that it knows to use the downloaded relay
descriptors as data source and the database as data sink.
Add the following two lines to your \verb+config+ file:
\begin{verbatim}
ImportDirectoryArchives 1
WriteRelayDescriptorDatabase 1
\end{verbatim}
You further need to provide the JDBC string that ERNIE shall use to access
the database schema \verb+tordir+ that we created above.
The config option with the JDBC string for a local PostgreSQL database
might be (without line break):
\begin{verbatim}
RelayDescriptorDatabaseJDBC
jdbc:postgresql://localhost/tordir?user=ernie&password=password
\end{verbatim}
\subsubsection{Importing relay descriptors using ERNIE}
Now you are ready to actually import relay descriptors using ERNIE.
Create a directory for Java class files, compile the Java source files,
and run ERNIE. All these steps are performed by the default target in the
provided Ant task.
\begin{verbatim}
$ ant
\end{verbatim}
Note that the import process might take between a few minutes and an hour,
depending on your hardware.
You will notice that ERNIE doesn't write progress messages to the standard
output, which is useful for unattended installations with only warnings
being mailed out by cron.
You can change this behavior and make messages on the standard output more
verbose by setting
\verb+java.util.logging.ConsoleHandler.level+ in
\verb+logging.properties+ to \verb+INFO+ or \verb+FINE+.
Alternately, you can look at the log file \verb+log.0+ that is created by
ERNIE.
If ERNIE finishes after a few seconds, you have probably put the relay
descriptors at the wrong place.
Make sure that you extract the relay descriptors to sub directories of
\verb+archives/+ in the ERNIE working directory.
If you interrupt ERNIE, or if ERNIE terminates uncleanly for some reason,
you will have problems starting it the next time.
ERNIE uses a local lock file called \verb+lock+ to make sure that only a
single instance of ERNIE is running at a time.
If you are sure that the last ERNIE instance isn't running anymore, you
can delete the lock file and start ERNIE again.
If all goes well, you should now have the relay descriptors of 1 month in
your database.
\subsubsection{Example queries}
In this tutorial, we want to give you a few examples for using the
database schema with the imported relay descriptors to extract some useful
statistics about the Tor network.
In the first example we want to find out how many relays have been running
on average per day and how many of these relays were exit relays.
We only need the \verb+statusentry+ table for this evaluation, because
the information we are interested in is contained in the network status
consensuses.
The SQL statement that we need for this evaluation consists of two parts:
First, we find out how many network status consensuses have been published
on any given day.
Second, we count all relays and those with the Exit flag and divide these
numbers by the number of network status consensuses per day.
\begin{verbatim}
$ psql -h localhost -U ernie tordir
tordir=> SELECT DATE(validafter),
COUNT(*) / relay_statuses_per_day.count AS avg_running,
SUM(CASE WHEN isexit IS TRUE THEN 1 ELSE 0 END) /
relay_statuses_per_day.count AS avg_exit
FROM statusentry,
(SELECT COUNT(*) AS count, DATE(validafter) AS date
FROM (SELECT DISTINCT validafter FROM statusentry)
distinct_consensuses
GROUP BY DATE(validafter)) relay_statuses_per_day
WHERE DATE(validafter) = relay_statuses_per_day.date
GROUP BY DATE(validafter), relay_statuses_per_day.count
ORDER BY DATE(validafter);
tordir=> \q
\end{verbatim}
Executing this query should finish within a few seconds to one minute,
again depending on your hardware.
The result might start like this (truncated here):
\begin{verbatim}
date | avg_running | avg_exit
------------+-------------+----------
2010-02-01 | 1583 | 627
2010-02-02 | 1596 | 638
2010-02-03 | 1600 | 654
:
\end{verbatim}
In the second example we want to find out what Tor software versions the
relays have been running.
More precisely, we want to know how many relays have been running what Tor
versions on micro version granularity (e.g., 0.2.2) on average per day?
We need to combine network status consensuses with server descriptors to
find out this information, because the version information is not
contained in the consensuses (or at least, it's optional to be contained
in there; and after all, this is just an example).
Note that we cannot focus on server descriptors only and leave out the
consensuses for this analysis, because we want our analysis to be limited
to running relays as confirmed by the directory authorities and not
include all descriptors that happened to be published at a given day.
The SQL statement again determines the number of consensuses per day in a
sub query.
In the next step, we join the \verb+statusentry+ table with the
\verb+descriptor+ table for all rows contained in the \verb+statusentry+
table.
The left join means that we include \verb+statusentry+ rows even if we do
not have corresponding lines in the \verb+descriptor+ table.
We determine the version by skipping the first 4 characters of the platform
string that should contain \verb+"Tor "+ (without quotes) and cutting off
after another 5 characters.
Obviously, this approach is prone to errors if the platform line format
changes, but it should be sufficient for this example.
\begin{verbatim}
$ psql -h localhost -U ernie tordir
tordir=> SELECT DATE(validafter) AS date,
SUBSTRING(platform, 5, 5) AS version,
COUNT(*) / relay_statuses_per_day.count AS count
FROM
(SELECT COUNT(*) AS count, DATE(validafter) AS date
FROM (SELECT DISTINCT validafter
FROM statusentry) distinct_consensuses
GROUP BY DATE(validafter)) relay_statuses_per_day
JOIN statusentry
ON relay_statuses_per_day.date = DATE(validafter)
LEFT JOIN descriptor
ON statusentry.descriptor = descriptor.descriptor
GROUP BY DATE(validafter), SUBSTRING(platform, 5, 5),
relay_statuses_per_day.count, relay_statuses_per_day.date
ORDER BY DATE(validafter), SUBSTRING(platform, 5, 5);
tordir=> \q
\end{verbatim}
Running this query takes longer than the first query, which can be a few
minutes to half an hour.
The main reason is that joining the two tables is an expensive database
operation.
If you plan to perform many evaluations like this one, you might want to
create a third table that holds the results of joining the two tables of
this tutorial.
Creating such a table to speed up queries is not specific to ERNIE and
beyond the scope of this tutorial.
The (truncated) result of the query might look like this:
\begin{verbatim}
date | version | count
------------+---------+-------
2010-02-01 | 0.1.2 | 10
2010-02-01 | 0.2.0 | 217
2010-02-01 | 0.2.1 | 774
2010-02-01 | 0.2.2 | 75
2010-02-01 | | 505
2010-02-02 | 0.1.2 | 14
2010-02-02 | 0.2.0 | 328
2010-02-02 | 0.2.1 | 1143
2010-02-02 | 0.2.2 | 110
:
\end{verbatim}
Note that, in the fifth line, we are missing the server descriptors of 505
relays contained in network status consensuses published on 2010-02-01.
If you want to avoid such missing values, you'll have to import the server
descriptors of the previous month, too.
\subsubsection{Extending ERNIE to import further data into the database}
In this tutorial we have explained how to prepare a database, download
relay descriptors, configure ERNIE, import the descriptors, and execute
example queries.
This description is limited to a few examples by the very nature of a
tutorial.
If you want to extend ERNIE to import further data into your database,
you will have to perform at least two steps:
extend the database schema and modify the Java classes used for parsing.
The first step, extending the database schema, is not specific to ERNIE.
Just add the fields and tables to the schema definition.
The second step, modifying the Java classes used for parsing, is of course
specific to ERNIE.
You will have to look at two classes in particular:
The first class, \verb+RelayDescriptorDatabaseImporter+, contains the
prepared statements and methods used to add network status consensus
entries and server descriptors to the database.
The second class, \verb+RelayDescriptorParser+, contains the parsing logic
for the relay descriptors and decides what information to add to the
database, among other things.
This ends the tutorial on importing relay descriptors into a database.
Happy researching!
\subsection{Aggregating relay and bridge descriptors}
{\it Write me.}
\section{Software architecture}
{\it Write me. In particular, include overview of components:
\begin{itemize}
\item Data sources and data sinks
\item Java classes with data sources and data sinks
\item R scripts to process CSV output
\item Website
\end{itemize}
}
\section{Tor Metrics Portal setup}
{\it
Write me. In particular, include documentation of deployed ERNIE that
runs the metrics website.
This documentation has two purposes:
First, a reference setup can help others creating their own ERNIE
configuration that goes beyond the use cases as described above.
Second, we need to remember how things are configured anyway, so we can
as well document them here.}
\end{document}