Expand INSTALL.md.

Contains many suggestions by iwakeh.

Implements #20380.
This commit is contained in:
Karsten Loesing 2016-10-17 16:56:15 +02:00
parent 3081583d9f
commit 0ac703be63
3 changed files with 253 additions and 129 deletions

View File

@ -1,90 +1,281 @@
CollecTor -- Operator's Guide
=============================
# CollecTor Operator's Guide
Welcome to the Operator's Guide of CollecTor. This guide explains how
to set up a new CollecTor instance to download relay descriptors from the
Tor directory authorities.
Welcome to CollecTor, your friendly data-collecting service in the Tor network.
CollecTor fetches data from various nodes and services in the public Tor network
and makes it available to the world. This data includes relay descriptors from
the directory authorities, sanitized bridge descriptors from the bridge
authority, and other data about the Tor network.
This document describes how to set up your very own CollecTor instance. It was
written with an audience in mind that has at least some experience with running
services and is comfortable with the command line. It's not required that you
know how to read or even write Java code, though.
Before we go ahead with setting up your CollecTor instance, let us pause for a
moment and reflect why you'd want to do that as opposed to simply using data
from an existing CollecTor instance.
CollecTor is a service, and the best reason for running a CollecTor service
instance is to offer your collected Tor network data to others. You could
mirror the data from an existing instance or even aggregate data from multiple
instances by using the synchronization feature. Or you could fetch data from
public sources and provide your data to users and other CollecTor instances.
Another reason might be to collect or synchronize Tor network data and provide
it to your working or research group. And of course you might want to run a
CollecTor instance for testing purposes. In all these cases, setting up a
CollecTor instance might make sense.
However, if you only want to use Tor network data as a client, even as input for
another service you're developing, you don't have to and probably shouldn't run
a CollecTor instance. In that case it's sufficient to use a library like
[metrics-lib](https://dist.torproject.org/descriptor/) or
[Stem](https://stem.torproject.org/) to fetch CollecTor data and process it.
Requirements
------------
## Setting up the host
You'll need a Linux host with at least 50G disk space and 2G RAM.
You'll need a host with at least 200G disk space and 4G RAM.
In the following we'll assume that the host runs Debian stable as
operating system, but it should work on any other Linux or possibly even
*BSD. Though you'll be mostly on your own with those.
In the following we'll assume that your host runs Debian stable as operating
system. CollecTor should run on any other Linux or possibly even *BSD, though
you'll be mostly on your own with those. And as Java is available on a variety
of other operating systems, those might work, too, but, again, you'll be on your
own.
As Java is available on a variety of other operating systems, these might
work, too. But again you'll be on your own.
CollecTor does not require installing many or specific dependencies on the host
system. All it needs are a Java Runtime Environment version 7 or higher and an
Apache HTTP Server version 2 or higher.
Prepare the system
------------------
The CollecTor service runs entirely under a non-privileged user account. Any
user account will do, but feel free to create a new user account just for the
CollecTor service, if you prefer.
CollecTor is provided by The Tor Project and can be found here:
https://dist.torproject.org/collector/
Download the tar.gz file with the version number listed in build.xml.
The README inside the tar.gz file has all the information about CollecTor
and explains how to verify the downloaded files.
The CollecTor service requires running in a working directory where it can store
Tor network data and state files. This working directory can be located
anywhere in the file system as long as there is enough disk space available.
The Apache service will later need to know where to find files to serve to web
clients including other CollecTor instances.
You need a Java installation. On Debian you can just run:
CollecTor does not require setting up a database.
$ sudo apt-get openjdk-7-jdk
This concludes the host setup. Later in the process you'll once more need root
privileges to configure Apache to serve CollecTor files. But until then you can
do all setup steps with the non-privileged user account.
Configure the relay descriptor downloader
-----------------------------------------
Run
$ java -DLOGBASE=/path/to/logs -jar collector-<version>.jar
once in order to obtain a configuration properties file.
## Setting up the service
There are quite a few options to set in collector.properties and the comments
explain their meaning. So, you can set the options to the values you want.
### Obtaining the code
Create the paths you set in collector.properties.
CollecTor releases are available at:
Example: run the relay descriptor downloader
--------------------------------------------
```https://dist.torproject.org/collector/```
This is a small example about how CollecTor is used. All the other
settings are explained in the default collector.properties.
Choose the latest tarball and signature file, verify the signature on the
tarball, and extract the tarball in a location of your choice which will create
a subdirectory called `collector-<version>/`.
For running the relay descriptor downloader:
Edit collector.properties and set at least the following value to true:
### Planning the service setup
DownloadRelayDescriptors = true
By default, CollecTor is configured to do nothing at all. The reason is that
new operators should first understand its capabilities and make a plan for
configuring their new CollecTor instance. Let's do that now.
$ java -DLOGBASE=/path/to/logs -jar collector-<version>.jar </place/of/collector.properties>
CollecTor consists of a background updater with an internal scheduler and
several data-collecting modules that write data to local directories which are
then served by a webserver. Each of the modules can have one or more data
sources, some public like relay descriptors served by the directory authorities
and some private like bridge descriptors uploaded to the bridge directory
authority.
Watch out for INFO-level logs in the log directory you configured. In
particular, the lines following "Statistics on the completeness of written
relay descriptors:" are quite important.
You'll have to decide which of the data-collecting modules you want to activate,
how often to execute these modules, and which data sources to collect data from.
In case of the unforeseen ERROR and WARN level logs should help you troubleshoot
your installation.
The release tarball contains an executable .jar file:
Maintenance
-----------
```collector-<version>/generated/dist/collector-<version>.jar```
CollecTor is designed to keep running and attempts to re-run modules even
when previous runs stopped because of a problem. Thus, it is very important
to watch out for WARNING level and especially ERROR level log statements.
Copy this .jar file into the working directory and run it:
These often will point to problems you can do something about, e.g. a full disk
or missing file system permissions.
```java -jar collector-<version>.jar```
Logging Configuration
---------------------
CollecTor will print some text about not being able to find a configuration
file, which is understandable since there is no such file yet. It also writes a
fresh configuration file called `collector.properties` to the working directory
which contains defaults (that instruct CollecTor to do nothing).
Some hints for those who are familiar with Logback:
Read through that file to learn about all available configuration options.
If you want to use your own logging configuration for Logback you can simply
create your own logback.xml or logback.groovy and start CollecTor in the
following way:
java -cp /folder/with/logback:collector-1.0.0.jar org.torproject.collector.Main
</place/of/collector.properties>
### Performing the initial run
When you have made a plan how to configure your CollecTor instance, edit the
`collector.properties` file, set it to run only once, activate all relevant
modules, check and possibly edit other options as needed, and save the file.
Run the Java process using:
```java -Xmx2g -DLOGBASE=<your-log-dir> -jar collector-<version>.jar
<your-collector.properties>```
The option `-Xmx2g` sets the maximum heap space to 2G, which is based on the
recommended 4G total RAM size for the host. If you have more memory to spare,
feel free to adapt this option as needed.
This may take a while, depending on which modules you activated. Read the logs
to learn if the run was successful. If it wasn't, go back to editing the
properties file and re-run the .jar file. Change the run-once option back when
you're done with the initial run of the Java process.
Complete the initialization step by copying the shell script
`collector-<version>/src/main/resources/create-tarballs.sh` from the release
tarball to the working directory or another location of your choice, editing the
contained paths, and executing it. Note that this script will at least partly
fail if one or more modules are deactivated.
### Scheduling periodic runs
The next step in setting up the CollecTor instance is to start the updater with
its internal scheduler and let it run continuously in the background. In order
to do so, make sure the run-once property is set to `false`, possibly adapt the
scheduling properties, and execute the .jar file using the same command as above
but this time in the background. Make sure that the same command will be run
automatically after a reboot.
Also make sure that the `create-tarballs.sh` script will be executed at least
every three days, but no more than once per day.
### Setting up the website
The last remaining part in the setup process is to make the collected data
available. Copy the contents from `collector-<version>/src/main/webapp/*` in
the release tarball to a web application subdirectory in the working directory
or another location of your choice.
Configure an Apache site that uses redirects or symbolic links to serve the
following directories or files in your working directory (where paths in <>
refer to settings in `collector.properties`):
* `<your-webapp-dir>/*`,
* `<ArchivePath>`,
* `<IndexPath>`, and
* `<RecentPath>`.
Use your browser to make sure that your instance serves the web pages and data
that you'd expect.
## Maintaining the service
### Monitoring the service
The most important information about your CollecTor instance is whether it is
alive. Otherwise, if it dies and you don't notice, you might be losing data
that is not available at the data sources anymore. You should set up a
notification mechanism of your choice to be informed quickly when the background
updater dies.
Other than fatal issues, a good source for learning about issues with your
CollecTor instance are its logs. Be sure to read the logs every now and then,
and look out for warnings and errors. Maybe set up another notification to be
informed quickly of new warnings or errors.
### Changing logging options
CollecTor uses Logback for logging and comes with a default logging
configuration that logs on info level and that creates a common log file that
rotates once per day and a separate log file per module. If you want to change
logging options, copy the default logging configuration from
`collector-<version>/src/main/resources/logback.xml` to your working directory,
edit your copy, and execute the .jar file as follows:
```java -Xmx2g -DLOGBASE=<your-log-dir> -jar -cp .:collector-<version>.jar
org.torproject.collector.Main```
Internally, CollecTor uses the Simple Logging Facade for Java (SLF4J) and ships
with the Logback implementation for SLF4J. If you prefer a different logging
framework, you can provide and use that instead. For more detailed information,
or if you have different logging needs, please refer to the [Logback
documentation](http://logback.qos.ch/), and for switching to a different
framework to the [SFL4J website](http://www.slf4j.org/).
### Changing configuration options
If you need to reconfigure your CollecTor instance, you may be able to do that
without stopping and restarting the Java process. Scheduling settings are
exempt from this, but all general and module settings may be changed at
run-time. Just edit the config file, and the changes will become effective in
the next execution of a module. Changes to the scheduler, however, require
stopping and restarting the Java update process.
### Stopping the service (gracefully)
If you need to stop the background updater for some reason, like rebooting the
host, there is a way to do that gracefully: kill the Java process, and a
shutdown hook will stop the internal scheduler and wait for up to 10 minutes (or
whatever amount of time is configured) for all currently running updates to be
finished. However, if you must stop the process immediately, use `kill -9`,
though you might have to clean up manually. You should try to avoid rebooting
while tarballs are being created.
### Upgrading and downgrading
If you need to upgrade to a newer release or downgrade to a previous release,
download that tarball and extract it, and copy over the executable .jar file and
the `create-tarballs.sh` script in case it has changed. Stop the current
service version as described above, possibly adapt your `collector.properties`
file as necessary, and restart the Java process using the new .jar file. Don't
forget to update the version number in the command that ensures that the .jar
file gets executed automatically after a reboot. Watch the logs to see if the
upgrade or downgrade was successful.
### Backing up data and settings
A backup of your CollecTor instance should include the <ArchivePath> and your
configuration, which would enable you to set up this instance again. A backup
for short term recovery would also include the more volatile data in
<StatsPath>, <RecentPath>, and <OutputPath>.
### Performing recurring tasks
Most of CollecTor is designed to just run in the background forever. However,
some parts still require manual housekeeping every month or two: You'll need to
clean up data from `<OutputPath>` as configured in `collector.properties` when
you're certain that the data is contained in tarballs and contained in backups.
Likewise, you'll have to delete old files from `<BridgeLocalOrigins>`, in case
that is being used, where CollecTor only reads and never writes or deletes.
### Resolving common issues
Unfortunately, CollecTor still runs into issues from time to time, and some of
these issues require a human being to decide whether they're harmless or require
intervention by the operator.
The most common issue these days is a warning about missing too many referenced
descriptors, which may even be true but which is typically not an operations
issue.
A lot less frequently, the bridgedesc module reports unrecognized lines in
non-sanitized bridge descriptors which, if true, requires developing and
deploying a patch. And sometimes the bridgedesc module complains about stale
input data, which requires fixing the bridge authority or the sync mechanism to
the CollecTor host.
Another minor issue is that files in `<OutputPath>` may change while tarballs
are being created, which is usually safe to ignore.
There's another frequent error message where CollecTor complains about not being
able to fetch a remote file during the sync process. This error message is
usually harmless and can be ignored.
But let's hope that you won't run into any of these issues or at least not
frequently. Enjoy your new CollecTor instance!
The default configuration can be found in the tar-ball you downloaded, in
the subdirectory collector-1.0.0/src/main/resources.

View File

@ -1,62 +0,0 @@
CollecTor -- The friendly data-collecting service in the Tor network
====================================================================
CollecTor fetches data from various nodes and services in the public
Tor network and makes it available to the world.
Verifying releases
------------------
Releases can be cryptographically verified to get some more confidence that
they were put together by a Tor developer. The following steps explain the
verification process by example.
Download the release tarball and the separate signature file:
```
wget https://dist.torproject.org/collector/1.0.0/collector-1.0.0.tar.gz
wget https://dist.torproject.org/collector/1.0.0/collector-1.0.0.tar.gz.asc
```
Attempt to verify the signature on the tarball:
```
gpg --verify collector-1.0.0.tar.gz.asc
```
If the signature cannot be verified due to the public key of the signer
not being locally available, download that public key from one of the key
servers and retry:
```
gpg --keyserver pgp.mit.edu --recv-key 0x4EFD4FDC3F46D41E
gpg --verify collector-1.0.0.tar.gz.asc
```
If the signature still cannot be verified, something is wrong!
But note that even if it can be verified, you now only know that the
signature was made by the person claiming to own this key, which could be
anyone. You'll need a trust path to the owner of this key in order to
trust this signature, but that's clearly out of scope here. In short,
your best chance is to meet a Tor developer in real life and enter the web
of trust.
If you want to go one step further in the verification game, you can
verify the signature on the .jar files.
Print and then import the provided X.509 certificate:
```
keytool -printcert -file CERT
keytool -importcert -alias karsten -file CERT
```
Verify the signatures on the contained .jar files using Java's jarsigner
tool:
```
jarsigner -verify collector-1.0.0.jar
jarsigner -verify collector-1.0.0-sources.jar
```

View File

@ -24,9 +24,7 @@ YEARTWO=`date --date='7 days ago' +%Y`
MONTHTWO=`date --date='7 days ago' +%m`
CURRENTPATH=`pwd`
if ! test -d $WORKDIR
then mkdir $WORKDIR
fi
mkdir -p $WORKDIR
cd $WORKDIR
@ -35,10 +33,7 @@ if ! test -d $OUTDIR
exit 1
fi
if ! test -d $TARBALLTARGETDIR
then echo "$TARBALLTARGETDIR doesn't exist. Exiting."
exit 1
fi
mkdir -p $TARBALLTARGETDIR
TARBALLS=(
exit-list-$YEARONE-$MONTHONE