mirror of
https://github.com/torproject/collector.git
synced 2024-10-06 23:43:29 +00:00
Rewrite documentation.
This commit is contained in:
parent
eab1ebc85a
commit
94c7158834
2
.gitignore
vendored
2
.gitignore
vendored
@ -5,8 +5,6 @@ config
|
||||
data/
|
||||
in/
|
||||
out/
|
||||
doc/manual.aux
|
||||
doc/manual.log
|
||||
log/
|
||||
rsync/
|
||||
stats/
|
||||
|
80
INSTALL.md
Normal file
80
INSTALL.md
Normal file
@ -0,0 +1,80 @@
|
||||
CollecTor -- Operator's Guide
|
||||
=============================
|
||||
|
||||
Welcome to the Operator's Guide of CollecTor. This guide explains how
|
||||
to set up a new CollecTor instance to download relay descriptors from the
|
||||
Tor directory authorities.
|
||||
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
You'll need a Linux host with at least 50G disk space and 2G RAM.
|
||||
|
||||
In the following we'll assume that the host runs Debian stable as
|
||||
operating system, but it should work on any other Linux or possibly even
|
||||
*BSD. Though you'll be mostly on your own with those.
|
||||
|
||||
|
||||
Prepare the system
|
||||
------------------
|
||||
|
||||
Create a working directory for CollecTor. In this guide, we'll assume
|
||||
that you're using `/srv/collector.torproject.org/` as working directory,
|
||||
but feel free to use another directory that better suits your needs.
|
||||
|
||||
$ sudo mkdir -p /srv/collector.torproject.org/
|
||||
$ sudo chown vagrant:vagrant /srv/collector.torproject.org/
|
||||
|
||||
Install a few packages:
|
||||
|
||||
$ sudo apt-get install openjdk-6-jdk ant libcommons-codec-java \
|
||||
libcommons-compress-java
|
||||
|
||||
|
||||
Clone the metrics-db repository
|
||||
-------------------------------
|
||||
|
||||
$ cd /srv/collector.torproject.org/
|
||||
$ git clone https://git.torproject.org/metrics-db
|
||||
|
||||
|
||||
Clone required submodule metrics-lib
|
||||
------------------------------------
|
||||
|
||||
$ git submodule init
|
||||
$ git submodule update
|
||||
|
||||
|
||||
Compile CollecTor
|
||||
-----------------
|
||||
|
||||
$ ant compile
|
||||
|
||||
|
||||
Configure the relay descriptor downloader
|
||||
-----------------------------------------
|
||||
|
||||
Edit the config file and uncomment and edit at least the following line:
|
||||
|
||||
DownloadRelayDescriptors 1
|
||||
|
||||
|
||||
Run the relay descriptor downloader
|
||||
-----------------------------------
|
||||
|
||||
$ bin/run-relaydescs
|
||||
|
||||
|
||||
Set up an hourly cronjob for the relay descriptor downloader
|
||||
------------------------------------------------------------
|
||||
|
||||
Ideally, run the relay descriptor downloader once per hour by adding a
|
||||
crontab entry like the following:
|
||||
|
||||
6 * * * * cd /srv/collector.torproject.org/db/ && bin/run-relaydescs
|
||||
|
||||
Watch out for INFO-level logs in the `log/` directory. In particular, the
|
||||
lines following "Statistics on the completeness of written relay
|
||||
descriptors:" is quite important.
|
||||
|
7
README
7
README
@ -1,7 +0,0 @@
|
||||
ERNIE is the Enhanced R-based tor Network Intelligence Engine
|
||||
(why ERNIE? because nobody liked BIRT; sorry for misspelling Tor)
|
||||
|
||||
--------------------------------------------------------------------------
|
||||
|
||||
Please find documentation in doc/ .
|
||||
|
BIN
doc/manual.pdf
BIN
doc/manual.pdf
Binary file not shown.
548
doc/manual.tex
548
doc/manual.tex
@ -1,548 +0,0 @@
|
||||
\documentclass{article}
|
||||
\begin{document}
|
||||
\title{ERNIE: a tool to study the Tor network\\-- User's Guide --}
|
||||
\author{by Karsten Loesing \texttt{<karsten@torproject.org>}}
|
||||
\maketitle
|
||||
|
||||
\section{Overview}
|
||||
|
||||
Welcome to ERNIE!
|
||||
ERNIE is a tool to study the Tor network.
|
||||
ERNIE has been designed to process all kinds of data about the Tor network
|
||||
and visualize them or prepare them for further analysis.
|
||||
ERNIE is also the software behind the Tor Metrics Portal
|
||||
\verb+http://metrics.torproject.org/+.
|
||||
|
||||
The acronym ERNIE stands for the \emph{Enhanced R-based tor Network
|
||||
Intelligence Engine} (sorry for misspelling Tor).
|
||||
Why ERNIE?
|
||||
Because nobody liked BIRT (Business Intelligence and Reporting Tools) that
|
||||
we used for visualizing statistics about the Tor network before writing
|
||||
our own software.
|
||||
By the way, reasons were that BIRT made certain people's browsers crash
|
||||
and requires JavaScript that most Tor user have turned off.
|
||||
|
||||
If you want to learn more about the Tor network, regardless of whether you
|
||||
want to present your findings on a website (like ERNIE does) or include
|
||||
them in your next Tor paper, this user's guide is for you!
|
||||
|
||||
\section{Installation instructions}
|
||||
|
||||
ERNIE depends on various other software tools. ERNIE is developed in a
|
||||
\emph{Git} repository which is currently the only way to download it.
|
||||
ERNIE uses \emph{Java} for parsing data, \emph{R} for plotting graphs,
|
||||
and \emph{PostgreSQL} for importing data into a database.
|
||||
Which of these tools you need depends on what tasks you are planning to
|
||||
use ERNIE for.
|
||||
In most cases it is not required to install all these tools.
|
||||
For this tutorial, we assume Debian GNU/Linux 5.0 as operating system.
|
||||
Installation instructions for other platforms may vary.
|
||||
|
||||
\subsection{Git 1.5.6.5}
|
||||
|
||||
Currently, the only way to download ERNIE is to clone its Git branch.
|
||||
|
||||
Install Git 1.5.6.5 (or higher) and check that it's working:
|
||||
|
||||
\begin{verbatim}
|
||||
$ sudo apt-get install git-core
|
||||
$ git --version
|
||||
\end{verbatim}
|
||||
|
||||
\subsection{Java 6}
|
||||
|
||||
ERNIE requires Java to parse data from various data sources and write them
|
||||
to one or more data sinks. Java is required for most use cases of ERNIE.
|
||||
|
||||
Add the non-free repository to the apt sources in
|
||||
\verb+/etc/apt/sources.list+ by changing the line (mirror URL may vary):
|
||||
|
||||
\begin{verbatim}
|
||||
deb http://ftp.ca.debian.org/debian/ lenny main
|
||||
\end{verbatim}
|
||||
|
||||
to
|
||||
|
||||
\begin{verbatim}
|
||||
deb http://ftp.ca.debian.org/debian/ lenny main non-free
|
||||
\end{verbatim}
|
||||
|
||||
Fetch the package list, install Sun Java 6, and set it as system default:
|
||||
|
||||
\begin{verbatim}
|
||||
$ sudo apt-get update
|
||||
$ sudo apt-get install sun-java6-jdk
|
||||
$ sudo update-alternatives --set java \
|
||||
/usr/lib/jvm/java-6-sun/jre/bin/java
|
||||
$ sudo update-alternatives --set javac \
|
||||
/usr/lib/jvm/java-6-sun/bin/javac
|
||||
\end{verbatim}
|
||||
|
||||
Check that Java 6 is installed and selected as default:
|
||||
|
||||
\begin{verbatim}
|
||||
$ java -version
|
||||
$ javac -version
|
||||
\end{verbatim}
|
||||
|
||||
\subsection{Ant 1.7}
|
||||
|
||||
ERNIE comes with an Ant build file that facilitates common build tasks.
|
||||
If you want to use Ant to build and run ERNIE, install Ant and check its
|
||||
installed version (tested with 1.7):
|
||||
|
||||
\begin{verbatim}
|
||||
$ sudo apt-get install ant
|
||||
$ ant -version
|
||||
\end{verbatim}
|
||||
|
||||
\subsection{R 2.8 and ggplot2}
|
||||
|
||||
ERNIE uses R and the R library \emph{ggplot2} to visualize anaylsis
|
||||
results for presentation on a website or for inclusion in publications.
|
||||
ggplot2 requires at least R version 2.8 to be installed.
|
||||
|
||||
Add a new line to \verb+/etc/apt/sources.list+:
|
||||
|
||||
\begin{verbatim}
|
||||
deb http://cran.cnr.berkeley.edu/bin/linux/debian lenny-cran/
|
||||
\end{verbatim}
|
||||
|
||||
Download the package maintainer's public key (``Johannes Ranke (CRAN Debian
|
||||
archive) $<$jranke@uni-bremen.de$>$''):
|
||||
|
||||
\begin{verbatim}
|
||||
$ gpg --keyserver pgpkeys.pca.dfn.de --recv-key 381BA480
|
||||
$ gpg --export 381BA480 | sudo apt-key add -
|
||||
\end{verbatim}
|
||||
|
||||
Install the most recent R version:
|
||||
|
||||
\begin{verbatim}
|
||||
$ sudo apt-get update
|
||||
$ sudo apt-get -t unstable install r-base
|
||||
\end{verbatim}
|
||||
|
||||
Start R to check its version (must be 2.8 or higher) and install ggplot2.
|
||||
Do this as root, so that the installed package is available to all system
|
||||
users:
|
||||
|
||||
\begin{verbatim}
|
||||
$ sudo R
|
||||
> install.packages("ggplot2")
|
||||
> q()
|
||||
\end{verbatim}
|
||||
|
||||
Confirm that R and ggplot2 are installed:
|
||||
|
||||
\begin{verbatim}
|
||||
$ R
|
||||
> library(ggplot2)
|
||||
> q()
|
||||
\end{verbatim}
|
||||
|
||||
\subsection{PostgreSQL 8.3}
|
||||
\label{sec-install-postgres}
|
||||
|
||||
ERNIE uses PostgreSQL to import data into a database for later analysis.
|
||||
This feature is not required for most use cases of ERNIE, but only for
|
||||
people who prefer having the network data in a database to execute custom
|
||||
queries.
|
||||
|
||||
Install PostgreSQL 8.3 using apt-get:
|
||||
|
||||
\begin{verbatim}
|
||||
$ sudo apt-get install postgresql-8.3
|
||||
\end{verbatim}
|
||||
|
||||
Create a new database user \verb+ernie+ to insert data and run queries.
|
||||
This command is executed as unix user \verb+postgres+ and therefore as
|
||||
database superuser \verb+postgres+ via ident authentication. The
|
||||
\verb+-P+ flag issues a password prompt for the new user.
|
||||
There is no need to give the new user superuser privileges or allow it to
|
||||
create databases or new roles.
|
||||
|
||||
\begin{verbatim}
|
||||
$ sudo -u postgres createuser -P ernie
|
||||
\end{verbatim}
|
||||
|
||||
Create a new database schema \verb+tordir+ owned by user \verb+ernie+
|
||||
(using option \verb+-O+).
|
||||
Again, this command is executed as \verb+postgres+ system user to make use
|
||||
of ident authentication.
|
||||
|
||||
\begin{verbatim}
|
||||
$ sudo -u postgres createdb -O ernie tordir
|
||||
\end{verbatim}
|
||||
|
||||
Log into the database schema as user \verb+ernie+ to check that it's
|
||||
working.
|
||||
This time, ident authentication is not available, since there is no system
|
||||
user \verb+ernie+.
|
||||
Instead, we will use password authentication via a TCP connection to
|
||||
localhost (using option \verb+-h+) as database user \verb+ernie+ (using
|
||||
option \verb+-U+).
|
||||
|
||||
\begin{verbatim}
|
||||
$ psql -h localhost -U ernie tordir
|
||||
tordir=> \q
|
||||
\end{verbatim}
|
||||
|
||||
\subsection{ERNIE}
|
||||
|
||||
Finally, you can install ERNIE by cloning its Git branch:
|
||||
|
||||
\begin{verbatim}
|
||||
$ git clone git://git.torproject.org/ernie
|
||||
\end{verbatim}
|
||||
|
||||
This command should create a directory \verb+ernie/+ which we will
|
||||
consider the working directory of ERNIE.
|
||||
|
||||
\section{Getting started with ERNIE}
|
||||
|
||||
The ERNIE project was started as a simple tool to parse Tor relay
|
||||
descriptors and plot graphs on Tor network usage for a website.
|
||||
Since then, ERNIE has grown to a tool that can process all kinds of Tor
|
||||
network data for various purposes, including but not limited to
|
||||
visualization.
|
||||
|
||||
We think that the easiest way to get started with ERNIE is to walk through
|
||||
typical use cases in a tutorial style and explain what is required to set
|
||||
up ERNIE.
|
||||
These use cases have been chosen from what we think are typical
|
||||
applications of ERNIE.
|
||||
|
||||
\subsection{Visualizing network statistics}
|
||||
|
||||
{\it Write me.}
|
||||
|
||||
\subsection{Importing relay descriptors into a database}
|
||||
|
||||
As of February 2010, the relays and directories in the Tor network
|
||||
generate more than 1 GB of descriptors every month.
|
||||
There are two approaches to process these amounts of data:
|
||||
extract only the relevant data for the analysis and write them to files,
|
||||
or import all data to a database and run queries on the database.
|
||||
ERNIE currently takes the file-based approach for the Metrics Portal,
|
||||
which works great for standardized analyses.
|
||||
But the more flexible way to research the Tor network is to work with a
|
||||
database.
|
||||
|
||||
This tutorial describes how to import relay descriptors into a database
|
||||
and run a few example queries.
|
||||
Note that the presented database schema is limited to answering basic
|
||||
questions about the Tor network.
|
||||
In order to answer more complex questions, one would have to extend the
|
||||
database schema and Java classes which is sketched at the end of this
|
||||
tutorial.
|
||||
|
||||
\subsubsection{Preparing database for data import}
|
||||
|
||||
The first step in importing relay descriptors into a database is to
|
||||
install a database management system.
|
||||
See Section \ref{sec-install-postgres} for installation instructions of
|
||||
PostgreSQL 8.3 on Debian GNU/Linux 5.0.
|
||||
Note that in theory, any other relational database that has a working JDBC
|
||||
4 driver should work, too, possibly with minor modifications to ERNIE.
|
||||
|
||||
Import the database schema from file \verb+db/tordir.sql+ containing two
|
||||
tables that we need for importing relay descriptors plus two indexes to
|
||||
accelerate queries. Check that tables have been created using \verb+\dt+.
|
||||
You should see a list containing the two tables \verb+descriptor+ and
|
||||
\verb+statusentry+.
|
||||
|
||||
\begin{verbatim}
|
||||
$ psql -h localhost -U ernie -f db/tordir.sql tordir
|
||||
$ psql -h localhost -U ernie tordir
|
||||
tordir=> \dt
|
||||
tordir=> \q
|
||||
\end{verbatim}
|
||||
|
||||
A row in the \verb+statusentry+ table contains the information that a
|
||||
given relay (that has published the server descriptor with ID
|
||||
\verb+descriptor+) was contained in the network status consensus published
|
||||
at time \verb+validafter+.
|
||||
These two fields uniquely identify a row in the \verb+statusentry+ table.
|
||||
The other fields contain boolean values for the flags that the directory
|
||||
authorities assigned to the relay in this consensus, e.g., the Exit flag
|
||||
in \verb+isexit+.
|
||||
Note that for the 24 network status consensuses of a given day with each
|
||||
of them containing 2000 relays, there will be $24 \times 2000$ rows in the
|
||||
\verb+statusentry+ table.
|
||||
|
||||
The \verb+descriptor+ table contains some portion of the information that
|
||||
a relay includes in its server descriptor.
|
||||
Descriptors are identified by the \verb+descriptor+ field which
|
||||
corresponds to the \verb+descriptor+ field in the \verb+statusentry+
|
||||
table.
|
||||
The other fields contain further data of the server descriptor that might
|
||||
be relevant for analyses, e.g., the platform line with the Tor software
|
||||
version and operating system of the relay.
|
||||
|
||||
Obviously, this data schema doesn't match everyone's needs.
|
||||
See the instructions below for extending ERNIE to import other data into
|
||||
the database.
|
||||
|
||||
\subsubsection{Downloading relay descriptors from the metrics website}
|
||||
|
||||
In the next step you will probably want to download relay descriptors from
|
||||
the metrics website
|
||||
\verb+http://metrics.torproject.org/data.html#relaydesc+.
|
||||
Download the \verb+v3 consensuses+ and/or \verb+server descriptors+ of the
|
||||
months you want to analyze.
|
||||
The server descriptors are the documents that relays publish at least
|
||||
every 18 hours describing their capabilities, whereas the v3 consensuses
|
||||
are views of the directory authorities on the available relays at a given
|
||||
time.
|
||||
For this tutorial you need both v3 consensuses and server descriptors.
|
||||
You might want to start with a single month of data, experiment with it,
|
||||
and import more data later on.
|
||||
Extract the tarballs to a new directory \verb+archives/+ in the ERNIE
|
||||
working directory.
|
||||
|
||||
\subsubsection{Configuring ERNIE to import relay descriptors into a
|
||||
database}
|
||||
|
||||
ERNIE can be used to read data from one or more data sources and write
|
||||
them to one or more data sinks.
|
||||
You need to configure ERNIE so that it knows to use the downloaded relay
|
||||
descriptors as data source and the database as data sink.
|
||||
Add the following two lines to your \verb+config+ file:
|
||||
|
||||
\begin{verbatim}
|
||||
ImportDirectoryArchives 1
|
||||
WriteRelayDescriptorDatabase 1
|
||||
\end{verbatim}
|
||||
|
||||
You further need to provide the JDBC string that ERNIE shall use to access
|
||||
the database schema \verb+tordir+ that we created above.
|
||||
The config option with the JDBC string for a local PostgreSQL database
|
||||
might be (without line break):
|
||||
|
||||
\begin{verbatim}
|
||||
RelayDescriptorDatabaseJDBC
|
||||
jdbc:postgresql://localhost/tordir?user=ernie&password=password
|
||||
\end{verbatim}
|
||||
|
||||
\subsubsection{Importing relay descriptors using ERNIE}
|
||||
|
||||
Now you are ready to actually import relay descriptors using ERNIE.
|
||||
Create a directory for Java class files, compile the Java source files,
|
||||
and run ERNIE. All these steps are performed by the default target in the
|
||||
provided Ant task.
|
||||
|
||||
\begin{verbatim}
|
||||
$ ant
|
||||
\end{verbatim}
|
||||
|
||||
Note that the import process might take between a few minutes and an hour,
|
||||
depending on your hardware.
|
||||
You will notice that ERNIE doesn't write progress messages to the standard
|
||||
output, which is useful for unattended installations with only warnings
|
||||
being mailed out by cron.
|
||||
You can change this behavior and make messages on the standard output more
|
||||
verbose by setting
|
||||
\verb+java.util.logging.ConsoleHandler.level+ in
|
||||
\verb+logging.properties+ to \verb+INFO+ or \verb+FINE+.
|
||||
Alternately, you can look at the log file \verb+log.0+ that is created by
|
||||
ERNIE.
|
||||
|
||||
If ERNIE finishes after a few seconds, you have probably put the relay
|
||||
descriptors at the wrong place.
|
||||
Make sure that you extract the relay descriptors to sub directories of
|
||||
\verb+archives/+ in the ERNIE working directory.
|
||||
|
||||
If you interrupt ERNIE, or if ERNIE terminates uncleanly for some reason,
|
||||
you will have problems starting it the next time.
|
||||
ERNIE uses a local lock file called \verb+lock+ to make sure that only a
|
||||
single instance of ERNIE is running at a time.
|
||||
If you are sure that the last ERNIE instance isn't running anymore, you
|
||||
can delete the lock file and start ERNIE again.
|
||||
|
||||
If all goes well, you should now have the relay descriptors of 1 month in
|
||||
your database.
|
||||
|
||||
\subsubsection{Example queries}
|
||||
|
||||
In this tutorial, we want to give you a few examples for using the
|
||||
database schema with the imported relay descriptors to extract some useful
|
||||
statistics about the Tor network.
|
||||
|
||||
In the first example we want to find out how many relays have been running
|
||||
on average per day and how many of these relays were exit relays.
|
||||
We only need the \verb+statusentry+ table for this evaluation, because
|
||||
the information we are interested in is contained in the network status
|
||||
consensuses.
|
||||
|
||||
The SQL statement that we need for this evaluation consists of two parts:
|
||||
First, we find out how many network status consensuses have been published
|
||||
on any given day.
|
||||
Second, we count all relays and those with the Exit flag and divide these
|
||||
numbers by the number of network status consensuses per day.
|
||||
|
||||
\begin{verbatim}
|
||||
$ psql -h localhost -U ernie tordir
|
||||
tordir=> SELECT DATE(validafter),
|
||||
COUNT(*) / relay_statuses_per_day.count AS avg_running,
|
||||
SUM(CASE WHEN isexit IS TRUE THEN 1 ELSE 0 END) /
|
||||
relay_statuses_per_day.count AS avg_exit
|
||||
FROM statusentry,
|
||||
(SELECT COUNT(*) AS count, DATE(validafter) AS date
|
||||
FROM (SELECT DISTINCT validafter FROM statusentry)
|
||||
distinct_consensuses
|
||||
GROUP BY DATE(validafter)) relay_statuses_per_day
|
||||
WHERE DATE(validafter) = relay_statuses_per_day.date
|
||||
GROUP BY DATE(validafter), relay_statuses_per_day.count
|
||||
ORDER BY DATE(validafter);
|
||||
tordir=> \q
|
||||
\end{verbatim}
|
||||
|
||||
Executing this query should finish within a few seconds to one minute,
|
||||
again depending on your hardware.
|
||||
The result might start like this (truncated here):
|
||||
|
||||
\begin{verbatim}
|
||||
date | avg_running | avg_exit
|
||||
------------+-------------+----------
|
||||
2010-02-01 | 1583 | 627
|
||||
2010-02-02 | 1596 | 638
|
||||
2010-02-03 | 1600 | 654
|
||||
:
|
||||
\end{verbatim}
|
||||
|
||||
In the second example we want to find out what Tor software versions the
|
||||
relays have been running.
|
||||
More precisely, we want to know how many relays have been running what Tor
|
||||
versions on micro version granularity (e.g., 0.2.2) on average per day?
|
||||
|
||||
We need to combine network status consensuses with server descriptors to
|
||||
find out this information, because the version information is not
|
||||
contained in the consensuses (or at least, it's optional to be contained
|
||||
in there; and after all, this is just an example).
|
||||
Note that we cannot focus on server descriptors only and leave out the
|
||||
consensuses for this analysis, because we want our analysis to be limited
|
||||
to running relays as confirmed by the directory authorities and not
|
||||
include all descriptors that happened to be published at a given day.
|
||||
|
||||
The SQL statement again determines the number of consensuses per day in a
|
||||
sub query.
|
||||
In the next step, we join the \verb+statusentry+ table with the
|
||||
\verb+descriptor+ table for all rows contained in the \verb+statusentry+
|
||||
table.
|
||||
The left join means that we include \verb+statusentry+ rows even if we do
|
||||
not have corresponding lines in the \verb+descriptor+ table.
|
||||
We determine the version by skipping the first 4 characters of the platform
|
||||
string that should contain \verb+"Tor "+ (without quotes) and cutting off
|
||||
after another 5 characters.
|
||||
Obviously, this approach is prone to errors if the platform line format
|
||||
changes, but it should be sufficient for this example.
|
||||
|
||||
\begin{verbatim}
|
||||
$ psql -h localhost -U ernie tordir
|
||||
tordir=> SELECT DATE(validafter) AS date,
|
||||
SUBSTRING(platform, 5, 5) AS version,
|
||||
COUNT(*) / relay_statuses_per_day.count AS count
|
||||
FROM
|
||||
(SELECT COUNT(*) AS count, DATE(validafter) AS date
|
||||
FROM (SELECT DISTINCT validafter
|
||||
FROM statusentry) distinct_consensuses
|
||||
GROUP BY DATE(validafter)) relay_statuses_per_day
|
||||
JOIN statusentry
|
||||
ON relay_statuses_per_day.date = DATE(validafter)
|
||||
LEFT JOIN descriptor
|
||||
ON statusentry.descriptor = descriptor.descriptor
|
||||
GROUP BY DATE(validafter), SUBSTRING(platform, 5, 5),
|
||||
relay_statuses_per_day.count, relay_statuses_per_day.date
|
||||
ORDER BY DATE(validafter), SUBSTRING(platform, 5, 5);
|
||||
tordir=> \q
|
||||
\end{verbatim}
|
||||
|
||||
Running this query takes longer than the first query, which can be a few
|
||||
minutes to half an hour.
|
||||
The main reason is that joining the two tables is an expensive database
|
||||
operation.
|
||||
If you plan to perform many evaluations like this one, you might want to
|
||||
create a third table that holds the results of joining the two tables of
|
||||
this tutorial.
|
||||
Creating such a table to speed up queries is not specific to ERNIE and
|
||||
beyond the scope of this tutorial.
|
||||
|
||||
The (truncated) result of the query might look like this:
|
||||
|
||||
\begin{verbatim}
|
||||
date | version | count
|
||||
------------+---------+-------
|
||||
2010-02-01 | 0.1.2 | 10
|
||||
2010-02-01 | 0.2.0 | 217
|
||||
2010-02-01 | 0.2.1 | 774
|
||||
2010-02-01 | 0.2.2 | 75
|
||||
2010-02-01 | | 505
|
||||
2010-02-02 | 0.1.2 | 14
|
||||
2010-02-02 | 0.2.0 | 328
|
||||
2010-02-02 | 0.2.1 | 1143
|
||||
2010-02-02 | 0.2.2 | 110
|
||||
:
|
||||
\end{verbatim}
|
||||
|
||||
Note that, in the fifth line, we are missing the server descriptors of 505
|
||||
relays contained in network status consensuses published on 2010-02-01.
|
||||
If you want to avoid such missing values, you'll have to import the server
|
||||
descriptors of the previous month, too.
|
||||
|
||||
\subsubsection{Extending ERNIE to import further data into the database}
|
||||
|
||||
In this tutorial we have explained how to prepare a database, download
|
||||
relay descriptors, configure ERNIE, import the descriptors, and execute
|
||||
example queries.
|
||||
This description is limited to a few examples by the very nature of a
|
||||
tutorial.
|
||||
If you want to extend ERNIE to import further data into your database,
|
||||
you will have to perform at least two steps:
|
||||
extend the database schema and modify the Java classes used for parsing.
|
||||
|
||||
The first step, extending the database schema, is not specific to ERNIE.
|
||||
Just add the fields and tables to the schema definition.
|
||||
|
||||
The second step, modifying the Java classes used for parsing, is of course
|
||||
specific to ERNIE.
|
||||
You will have to look at two classes in particular:
|
||||
The first class, \verb+RelayDescriptorDatabaseImporter+, contains the
|
||||
prepared statements and methods used to add network status consensus
|
||||
entries and server descriptors to the database.
|
||||
The second class, \verb+RelayDescriptorParser+, contains the parsing logic
|
||||
for the relay descriptors and decides what information to add to the
|
||||
database, among other things.
|
||||
|
||||
This ends the tutorial on importing relay descriptors into a database.
|
||||
Happy researching!
|
||||
|
||||
\subsection{Aggregating relay and bridge descriptors}
|
||||
|
||||
{\it Write me.}
|
||||
|
||||
\section{Software architecture}
|
||||
|
||||
{\it Write me. In particular, include overview of components:
|
||||
|
||||
\begin{itemize}
|
||||
\item Data sources and data sinks
|
||||
\item Java classes with data sources and data sinks
|
||||
\item R scripts to process CSV output
|
||||
\item Website
|
||||
\end{itemize}
|
||||
}
|
||||
|
||||
\section{Tor Metrics Portal setup}
|
||||
|
||||
{\it
|
||||
Write me. In particular, include documentation of deployed ERNIE that
|
||||
runs the metrics website.
|
||||
This documentation has two purposes:
|
||||
First, a reference setup can help others creating their own ERNIE
|
||||
configuration that goes beyond the use cases as described above.
|
||||
Second, we need to remember how things are configured anyway, so we can
|
||||
as well document them here.}
|
||||
|
||||
\end{document}
|
||||
|
Loading…
Reference in New Issue
Block a user