Rewrite documentation.

2024-10-06 23:43:29 +00:00 · 2015-05-16 11:40:25 +02:00 · 2015-05-16 11:40:25 +02:00 · 94c7158834
commit 94c7158834
parent eab1ebc85a
5 changed files with 80 additions and 557 deletions
--- a/.gitignore
+++ b/.gitignore
@ -5,8 +5,6 @@ config
 data/
 in/
 out/
-doc/manual.aux
-doc/manual.log
 log/
 rsync/
 stats/
--- a/INSTALL.md
+++ b/INSTALL.md
@ -0,0 +1,80 @@
+CollecTor -- Operator's Guide
+=============================
+
+Welcome to the Operator's Guide of CollecTor.  This guide explains how
+to set up a new CollecTor instance to download relay descriptors from the
+Tor directory authorities.
+
+
+Requirements
+------------
+
+You'll need a Linux host with at least 50G disk space and 2G RAM.
+
+In the following we'll assume that the host runs Debian stable as
+operating system, but it should work on any other Linux or possibly even
+*BSD.  Though you'll be mostly on your own with those.
+
+
+Prepare the system
+------------------
+
+Create a working directory for CollecTor.  In this guide, we'll assume
+that you're using `/srv/collector.torproject.org/` as working directory,
+but feel free to use another directory that better suits your needs.
+
+$ sudo mkdir -p /srv/collector.torproject.org/
+$ sudo chown vagrant:vagrant /srv/collector.torproject.org/
+
+Install a few packages:
+
+$ sudo apt-get install openjdk-6-jdk ant libcommons-codec-java \
+  libcommons-compress-java
+
+
+Clone the metrics-db repository
+-------------------------------
+
+$ cd /srv/collector.torproject.org/
+$ git clone https://git.torproject.org/metrics-db
+
+
+Clone required submodule metrics-lib
+------------------------------------
+
+$ git submodule init
+$ git submodule update
+
+
+Compile CollecTor
+-----------------
+
+$ ant compile
+
+
+Configure the relay descriptor downloader
+-----------------------------------------
+
+Edit the config file and uncomment and edit at least the following line:
+
+DownloadRelayDescriptors 1
+
+
+Run the relay descriptor downloader
+-----------------------------------
+
+$ bin/run-relaydescs
+
+
+Set up an hourly cronjob for the relay descriptor downloader
+------------------------------------------------------------
+
+Ideally, run the relay descriptor downloader once per hour by adding a
+crontab entry like the following:
+
+6 * * * * cd /srv/collector.torproject.org/db/ && bin/run-relaydescs
+
+Watch out for INFO-level logs in the `log/` directory.  In particular, the
+lines following "Statistics on the completeness of written relay
+descriptors:" is quite important.
+
--- a/7
+++ b/7
@ -1,7 +0,0 @@
-ERNIE is the Enhanced R-based tor Network Intelligence Engine
-         (why ERNIE? because nobody liked BIRT; sorry for misspelling Tor)
-
--------------------------------------------------------------------------
-
-Please find documentation in doc/ .
-
--- a/doc/manual.pdf
+++ b/doc/manual.pdf
--- a/doc/manual.tex
+++ b/doc/manual.tex
@ -1,548 +0,0 @@
-\documentclass{article}
-\begin{document}
-\title{ERNIE: a tool to study the Tor network\\-- User's Guide --}
-\author{by Karsten Loesing \texttt{<karsten@torproject.org>}}
-\maketitle
-
-\section{Overview}
-
-Welcome to ERNIE!
-ERNIE is a tool to study the Tor network.
-ERNIE has been designed to process all kinds of data about the Tor network
-and visualize them or prepare them for further analysis.
-ERNIE is also the software behind the Tor Metrics Portal
-\verb+http://metrics.torproject.org/+.
-
-The acronym ERNIE stands for the \emph{Enhanced R-based tor Network
-Intelligence Engine} (sorry for misspelling Tor).
-Why ERNIE?
-Because nobody liked BIRT (Business Intelligence and Reporting Tools) that
-we used for visualizing statistics about the Tor network before writing
-our own software.
-By the way, reasons were that BIRT made certain people's browsers crash
-and requires JavaScript that most Tor user have turned off.
-
-If you want to learn more about the Tor network, regardless of whether you
-want to present your findings on a website (like ERNIE does) or include
-them in your next Tor paper, this user's guide is for you!
-
-\section{Installation instructions}
-
-ERNIE depends on various other software tools. ERNIE is developed in a
-\emph{Git} repository which is currently the only way to download it.
-ERNIE uses \emph{Java} for parsing data, \emph{R} for plotting graphs,
-and \emph{PostgreSQL} for importing data into a database.
-Which of these tools you need depends on what tasks you are planning to
-use ERNIE for.
-In most cases it is not required to install all these tools.
-For this tutorial, we assume Debian GNU/Linux 5.0 as operating system.
-Installation instructions for other platforms may vary.
-
-\subsection{Git 1.5.6.5}
-
-Currently, the only way to download ERNIE is to clone its Git branch.
-
-Install Git 1.5.6.5 (or higher) and check that it's working:
-
-\begin{verbatim}
-$ sudo apt-get install git-core
-$ git --version
-\end{verbatim}
-
-\subsection{Java 6}
-
-ERNIE requires Java to parse data from various data sources and write them
-to one or more data sinks. Java is required for most use cases of ERNIE.
-
-Add the non-free repository to the apt sources in
-\verb+/etc/apt/sources.list+ by changing the line (mirror URL may vary):
-
-\begin{verbatim}
-deb http://ftp.ca.debian.org/debian/ lenny main
-\end{verbatim}
-
-to
-
-\begin{verbatim}
-deb http://ftp.ca.debian.org/debian/ lenny main non-free
-\end{verbatim}
-
-Fetch the package list, install Sun Java 6, and set it as system default:
-
-\begin{verbatim}
-$ sudo apt-get update
-$ sudo apt-get install sun-java6-jdk
-$ sudo update-alternatives --set java \
-      /usr/lib/jvm/java-6-sun/jre/bin/java
-$ sudo update-alternatives --set javac \
-      /usr/lib/jvm/java-6-sun/bin/javac
-\end{verbatim}
-
-Check that Java 6 is installed and selected as default:
-
-\begin{verbatim}
-$ java -version
-$ javac -version
-\end{verbatim}
-
-\subsection{Ant 1.7}
-
-ERNIE comes with an Ant build file that facilitates common build tasks.
-If you want to use Ant to build and run ERNIE, install Ant and check its
-installed version (tested with 1.7):
-
-\begin{verbatim}
-$ sudo apt-get install ant
-$ ant -version
-\end{verbatim}
-
-\subsection{R 2.8 and ggplot2}
-
-ERNIE uses R and the R library \emph{ggplot2} to visualize anaylsis
-results for presentation on a website or for inclusion in publications.
-ggplot2 requires at least R version 2.8 to be installed.
-
-Add a new line to \verb+/etc/apt/sources.list+:
-
-\begin{verbatim}
-deb http://cran.cnr.berkeley.edu/bin/linux/debian lenny-cran/
-\end{verbatim}
-
-Download the package maintainer's public key (``Johannes Ranke (CRAN Debian
-archive) $<$jranke@uni-bremen.de$>$''):
-
-\begin{verbatim}
-$ gpg --keyserver pgpkeys.pca.dfn.de --recv-key 381BA480
-$ gpg --export 381BA480 | sudo apt-key add -
-\end{verbatim}
-
-Install the most recent R version:
-
-\begin{verbatim}
-$ sudo apt-get update
-$ sudo apt-get -t unstable install r-base
-\end{verbatim}
-
-Start R to check its version (must be 2.8 or higher) and install ggplot2.
-Do this as root, so that the installed package is available to all system
-users:
-
-\begin{verbatim}
-$ sudo R
-> install.packages("ggplot2")
-> q()
-\end{verbatim}
-
-Confirm that R and ggplot2 are installed:
-
-\begin{verbatim}
-$ R
-> library(ggplot2)
-> q()
-\end{verbatim}
-
-\subsection{PostgreSQL 8.3}
-\label{sec-install-postgres}
-
-ERNIE uses PostgreSQL to import data into a database for later analysis.
-This feature is not required for most use cases of ERNIE, but only for
-people who prefer having the network data in a database to execute custom
-queries.
-
-Install PostgreSQL 8.3 using apt-get:
-
-\begin{verbatim}
-$ sudo apt-get install postgresql-8.3
-\end{verbatim}
-
-Create a new database user \verb+ernie+ to insert data and run queries.
-This command is executed as unix user \verb+postgres+ and therefore as
-database superuser \verb+postgres+ via ident authentication. The
-\verb+-P+ flag issues a password prompt for the new user.
-There is no need to give the new user superuser privileges or allow it to
-create databases or new roles.
-
-\begin{verbatim}
-$ sudo -u postgres createuser -P ernie
-\end{verbatim}
-
-Create a new database schema \verb+tordir+ owned by user \verb+ernie+
-(using option \verb+-O+).
-Again, this command is executed as \verb+postgres+ system user to make use
-of ident authentication.
-
-\begin{verbatim}
-$ sudo -u postgres createdb -O ernie tordir
-\end{verbatim}
-
-Log into the database schema as user \verb+ernie+ to check that it's
-working.
-This time, ident authentication is not available, since there is no system
-user \verb+ernie+.
-Instead, we will use password authentication via a TCP connection to
-localhost (using option \verb+-h+) as database user \verb+ernie+ (using
-option \verb+-U+).
-
-\begin{verbatim}
-$ psql -h localhost -U ernie tordir
-tordir=> \q
-\end{verbatim}
-
-\subsection{ERNIE}
-
-Finally, you can install ERNIE by cloning its Git branch: 
-
-\begin{verbatim}
-$ git clone git://git.torproject.org/ernie
-\end{verbatim}
-
-This command should create a directory \verb+ernie/+ which we will
-consider the working directory of ERNIE.
-
-\section{Getting started with ERNIE}
-
-The ERNIE project was started as a simple tool to parse Tor relay
-descriptors and plot graphs on Tor network usage for a website.
-Since then, ERNIE has grown to a tool that can process all kinds of Tor
-network data for various purposes, including but not limited to
-visualization.
-
-We think that the easiest way to get started with ERNIE is to walk through
-typical use cases in a tutorial style and explain what is required to set
-up ERNIE.
-These use cases have been chosen from what we think are typical
-applications of ERNIE.
-
-\subsection{Visualizing network statistics}
-
-{\it Write me.}
-
-\subsection{Importing relay descriptors into a database}
-
-As of February 2010, the relays and directories in the Tor network
-generate more than 1 GB of descriptors every month.
-There are two approaches to process these amounts of data:
-extract only the relevant data for the analysis and write them to files,
-or import all data to a database and run queries on the database.
-ERNIE currently takes the file-based approach for the Metrics Portal,
-which works great for standardized analyses.
-But the more flexible way to research the Tor network is to work with a
-database.
-
-This tutorial describes how to import relay descriptors into a database
-and run a few example queries.
-Note that the presented database schema is limited to answering basic
-questions about the Tor network.
-In order to answer more complex questions, one would have to extend the
-database schema and Java classes which is sketched at the end of this
-tutorial.
-
-\subsubsection{Preparing database for data import}
-
-The first step in importing relay descriptors into a database is to
-install a database management system.
-See Section \ref{sec-install-postgres} for installation instructions of
-PostgreSQL 8.3 on Debian GNU/Linux 5.0.
-Note that in theory, any other relational database that has a working JDBC
-4 driver should work, too, possibly with minor modifications to ERNIE.
-
-Import the database schema from file \verb+db/tordir.sql+ containing two
-tables that we need for importing relay descriptors plus two indexes to
-accelerate queries. Check that tables have been created using \verb+\dt+.
-You should see a list containing the two tables \verb+descriptor+ and
-\verb+statusentry+.
-
-\begin{verbatim}
-$ psql -h localhost -U ernie -f db/tordir.sql tordir
-$ psql -h localhost -U ernie tordir
-tordir=> \dt
-tordir=> \q
-\end{verbatim}
-
-A row in the \verb+statusentry+ table contains the information that a
-given relay (that has published the server descriptor with ID
-\verb+descriptor+) was contained in the network status consensus published
-at time \verb+validafter+.
-These two fields uniquely identify a row in the \verb+statusentry+ table.
-The other fields contain boolean values for the flags that the directory
-authorities assigned to the relay in this consensus, e.g., the Exit flag
-in \verb+isexit+.
-Note that for the 24 network status consensuses of a given day with each
-of them containing 2000 relays, there will be $24 \times 2000$ rows in the
-\verb+statusentry+ table.
-
-The \verb+descriptor+ table contains some portion of the information that
-a relay includes in its server descriptor.
-Descriptors are identified by the \verb+descriptor+ field which
-corresponds to the \verb+descriptor+ field in the \verb+statusentry+
-table.
-The other fields contain further data of the server descriptor that might
-be relevant for analyses, e.g., the platform line with the Tor software
-version and operating system of the relay.
-
-Obviously, this data schema doesn't match everyone's needs.
-See the instructions below for extending ERNIE to import other data into
-the database.
-
-\subsubsection{Downloading relay descriptors from the metrics website}
-
-In the next step you will probably want to download relay descriptors from
-the metrics website
-\verb+http://metrics.torproject.org/data.html#relaydesc+.
-Download the \verb+v3 consensuses+ and/or \verb+server descriptors+ of the
-months you want to analyze.
-The server descriptors are the documents that relays publish at least
-every 18 hours describing their capabilities, whereas the v3 consensuses
-are views of the directory authorities on the available relays at a given
-time.
-For this tutorial you need both v3 consensuses and server descriptors.
-You might want to start with a single month of data, experiment with it,
-and import more data later on.
-Extract the tarballs to a new directory \verb+archives/+ in the ERNIE
-working directory.
-
-\subsubsection{Configuring ERNIE to import relay descriptors into a
-database}
-
-ERNIE can be used to read data from one or more data sources and write
-them to one or more data sinks.
-You need to configure ERNIE so that it knows to use the downloaded relay
-descriptors as data source and the database as data sink.
-Add the following two lines to your \verb+config+ file:
-
-\begin{verbatim}
-ImportDirectoryArchives 1
-WriteRelayDescriptorDatabase 1
-\end{verbatim}
-
-You further need to provide the JDBC string that ERNIE shall use to access
-the database schema \verb+tordir+ that we created above.
-The config option with the JDBC string for a local PostgreSQL database
-might be (without line break):
-
-\begin{verbatim}
-RelayDescriptorDatabaseJDBC
-  jdbc:postgresql://localhost/tordir?user=ernie&password=password
-\end{verbatim}
-
-\subsubsection{Importing relay descriptors using ERNIE}
-
-Now you are ready to actually import relay descriptors using ERNIE.
-Create a directory for Java class files, compile the Java source files,
-and run ERNIE. All these steps are performed by the default target in the
-provided Ant task.
-
-\begin{verbatim}
-$ ant
-\end{verbatim}
-
-Note that the import process might take between a few minutes and an hour,
-depending on your hardware.
-You will notice that ERNIE doesn't write progress messages to the standard
-output, which is useful for unattended installations with only warnings
-being mailed out by cron.
-You can change this behavior and make messages on the standard output more
-verbose by setting
-\verb+java.util.logging.ConsoleHandler.level+ in
-\verb+logging.properties+ to \verb+INFO+ or \verb+FINE+.
-Alternately, you can look at the log file \verb+log.0+ that is created by
-ERNIE.
-
-If ERNIE finishes after a few seconds, you have probably put the relay
-descriptors at the wrong place.
-Make sure that you extract the relay descriptors to sub directories of
-\verb+archives/+ in the ERNIE working directory.
-
-If you interrupt ERNIE, or if ERNIE terminates uncleanly for some reason,
-you will have problems starting it the next time.
-ERNIE uses a local lock file called \verb+lock+ to make sure that only a
-single instance of ERNIE is running at a time.
-If you are sure that the last ERNIE instance isn't running anymore, you
-can delete the lock file and start ERNIE again.
-
-If all goes well, you should now have the relay descriptors of 1 month in
-your database.
-
-\subsubsection{Example queries}
-
-In this tutorial, we want to give you a few examples for using the
-database schema with the imported relay descriptors to extract some useful
-statistics about the Tor network.
-
-In the first example we want to find out how many relays have been running
-on average per day and how many of these relays were exit relays.
-We only need the \verb+statusentry+ table for this evaluation, because
-the information we are interested in is contained in the network status
-consensuses.
-
-The SQL statement that we need for this evaluation consists of two parts:
-First, we find out how many network status consensuses have been published
-on any given day.
-Second, we count all relays and those with the Exit flag and divide these
-numbers by the number of network status consensuses per day.
-
-\begin{verbatim}
-$ psql -h localhost -U ernie tordir
-tordir=> SELECT DATE(validafter),
-    COUNT(*) / relay_statuses_per_day.count AS avg_running,
-    SUM(CASE WHEN isexit IS TRUE THEN 1 ELSE 0 END) /
-      relay_statuses_per_day.count AS avg_exit
-  FROM statusentry,
-    (SELECT COUNT(*) AS count, DATE(validafter) AS date
-      FROM (SELECT DISTINCT validafter FROM statusentry)
-      distinct_consensuses
-      GROUP BY DATE(validafter)) relay_statuses_per_day
-  WHERE DATE(validafter) = relay_statuses_per_day.date
-  GROUP BY DATE(validafter), relay_statuses_per_day.count
-  ORDER BY DATE(validafter);
-tordir=> \q
-\end{verbatim}
-
-Executing this query should finish within a few seconds to one minute,
-again depending on your hardware.
-The result might start like this (truncated here):
-
-\begin{verbatim}
-    date    | avg_running | avg_exit
------------+-------------+----------
- 2010-02-01 |        1583 |      627
- 2010-02-02 |        1596 |      638
- 2010-02-03 |        1600 |      654
-:
-\end{verbatim}
-
-In the second example we want to find out what Tor software versions the
-relays have been running.
-More precisely, we want to know how many relays have been running what Tor
-versions on micro version granularity (e.g., 0.2.2) on average per day?
-
-We need to combine network status consensuses with server descriptors to
-find out this information, because the version information is not
-contained in the consensuses (or at least, it's optional to be contained
-in there; and after all, this is just an example).
-Note that we cannot focus on server descriptors only and leave out the
-consensuses for this analysis, because we want our analysis to be limited
-to running relays as confirmed by the directory authorities and not
-include all descriptors that happened to be published at a given day.
-
-The SQL statement again determines the number of consensuses per day in a
-sub query.
-In the next step, we join the \verb+statusentry+ table with the
-\verb+descriptor+ table for all rows contained in the \verb+statusentry+
-table.
-The left join means that we include \verb+statusentry+ rows even if we do
-not have corresponding lines in the \verb+descriptor+ table.
-We determine the version by skipping the first 4 characters of the platform
-string that should contain \verb+"Tor "+ (without quotes) and cutting off
-after another 5 characters.
-Obviously, this approach is prone to errors if the platform line format
-changes, but it should be sufficient for this example.
-
-\begin{verbatim}
-$ psql -h localhost -U ernie tordir
-tordir=> SELECT DATE(validafter) AS date,
-    SUBSTRING(platform, 5, 5) AS version,
-    COUNT(*) / relay_statuses_per_day.count AS count
-  FROM
-    (SELECT COUNT(*) AS count, DATE(validafter) AS date
-    FROM (SELECT DISTINCT validafter
-      FROM statusentry) distinct_consensuses
-    GROUP BY DATE(validafter)) relay_statuses_per_day
-  JOIN statusentry
-    ON relay_statuses_per_day.date = DATE(validafter)
-  LEFT JOIN descriptor
-    ON statusentry.descriptor = descriptor.descriptor
-  GROUP BY DATE(validafter), SUBSTRING(platform, 5, 5),
-    relay_statuses_per_day.count, relay_statuses_per_day.date
-  ORDER BY DATE(validafter), SUBSTRING(platform, 5, 5);
-tordir=> \q
-\end{verbatim}
-
-Running this query takes longer than the first query, which can be a few
-minutes to half an hour.
-The main reason is that joining the two tables is an expensive database
-operation.
-If you plan to perform many evaluations like this one, you might want to
-create a third table that holds the results of joining the two tables of
-this tutorial.
-Creating such a table to speed up queries is not specific to ERNIE and
-beyond the scope of this tutorial.
-
-The (truncated) result of the query might look like this:
-
-\begin{verbatim}
-    date    | version | count
------------+---------+-------
- 2010-02-01 | 0.1.2   |    10
- 2010-02-01 | 0.2.0   |   217
- 2010-02-01 | 0.2.1   |   774
- 2010-02-01 | 0.2.2   |    75
- 2010-02-01 |         |   505
- 2010-02-02 | 0.1.2   |    14
- 2010-02-02 | 0.2.0   |   328
- 2010-02-02 | 0.2.1   |  1143
- 2010-02-02 | 0.2.2   |   110
-:
-\end{verbatim}
-
-Note that, in the fifth line, we are missing the server descriptors of 505
-relays contained in network status consensuses published on 2010-02-01.
-If you want to avoid such missing values, you'll have to import the server
-descriptors of the previous month, too.
-
-\subsubsection{Extending ERNIE to import further data into the database}
-
-In this tutorial we have explained how to prepare a database, download
-relay descriptors, configure ERNIE, import the descriptors, and execute
-example queries.
-This description is limited to a few examples by the very nature of a
-tutorial.
-If you want to extend ERNIE to import further data into your database,
-you will have to perform at least two steps:
-extend the database schema and modify the Java classes used for parsing.
-
-The first step, extending the database schema, is not specific to ERNIE.
-Just add the fields and tables to the schema definition.
-
-The second step, modifying the Java classes used for parsing, is of course
-specific to ERNIE.
-You will have to look at two classes in particular:
-The first class, \verb+RelayDescriptorDatabaseImporter+, contains the
-prepared statements and methods used to add network status consensus
-entries and server descriptors to the database.
-The second class, \verb+RelayDescriptorParser+, contains the parsing logic
-for the relay descriptors and decides what information to add to the
-database, among other things.
-
-This ends the tutorial on importing relay descriptors into a database.
-Happy researching!
-
-\subsection{Aggregating relay and bridge descriptors}
-
-{\it Write me.}
-
-\section{Software architecture}
-
-{\it Write me. In particular, include overview of components:
-
-\begin{itemize}
-\item Data sources and data sinks
-\item Java classes with data sources and data sinks
-\item R scripts to process CSV output
-\item Website
-\end{itemize}
-}
-
-\section{Tor Metrics Portal setup}
-
-{\it
-Write me. In particular, include documentation of deployed ERNIE that
-runs the metrics website.
-This documentation has two purposes:
-First, a reference setup can help others creating their own ERNIE
-configuration that goes beyond the use cases as described above.
-Second, we need to remember how things are configured anyway, so we can
-as well document them here.}
-
-\end{document}
-