2016-10-17 14:56:15 +00:00
|
|
|
# CollecTor Operator's Guide
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
Welcome to CollecTor, your friendly data-collecting service in the Tor network.
|
|
|
|
CollecTor fetches data from various nodes and services in the public Tor network
|
|
|
|
and makes it available to the world. This data includes relay descriptors from
|
|
|
|
the directory authorities, sanitized bridge descriptors from the bridge
|
|
|
|
authority, and other data about the Tor network.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
This document describes how to set up your very own CollecTor instance. It was
|
|
|
|
written with an audience in mind that has at least some experience with running
|
|
|
|
services and is comfortable with the command line. It's not required that you
|
|
|
|
know how to read or even write Java code, though.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
Before we go ahead with setting up your CollecTor instance, let us pause for a
|
|
|
|
moment and reflect why you'd want to do that as opposed to simply using data
|
|
|
|
from an existing CollecTor instance.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
CollecTor is a service, and the best reason for running a CollecTor service
|
|
|
|
instance is to offer your collected Tor network data to others. You could
|
|
|
|
mirror the data from an existing instance or even aggregate data from multiple
|
|
|
|
instances by using the synchronization feature. Or you could fetch data from
|
|
|
|
public sources and provide your data to users and other CollecTor instances.
|
|
|
|
Another reason might be to collect or synchronize Tor network data and provide
|
|
|
|
it to your working or research group. And of course you might want to run a
|
|
|
|
CollecTor instance for testing purposes. In all these cases, setting up a
|
|
|
|
CollecTor instance might make sense.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
However, if you only want to use Tor network data as a client, even as input for
|
|
|
|
another service you're developing, you don't have to and probably shouldn't run
|
|
|
|
a CollecTor instance. In that case it's sufficient to use a library like
|
|
|
|
[metrics-lib](https://dist.torproject.org/descriptor/) or
|
|
|
|
[Stem](https://stem.torproject.org/) to fetch CollecTor data and process it.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
## Setting up the host
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
You'll need a host with at least 200G disk space and 4G RAM.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
In the following we'll assume that your host runs Debian stable as operating
|
|
|
|
system. CollecTor should run on any other Linux or possibly even *BSD, though
|
|
|
|
you'll be mostly on your own with those. And as Java is available on a variety
|
|
|
|
of other operating systems, those might work, too, but, again, you'll be on your
|
|
|
|
own.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
CollecTor does not require installing many or specific dependencies on the host
|
2016-11-03 17:20:03 +00:00
|
|
|
system. All it needs are a Java Runtime Environment version 7 or higher and
|
|
|
|
either Apache or nginx as HTTP Server.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
The CollecTor service runs entirely under a non-privileged user account. Any
|
|
|
|
user account will do, but feel free to create a new user account just for the
|
|
|
|
CollecTor service, if you prefer.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
The CollecTor service requires running in a working directory where it can store
|
|
|
|
Tor network data and state files. This working directory can be located
|
|
|
|
anywhere in the file system as long as there is enough disk space available.
|
2016-11-03 17:20:03 +00:00
|
|
|
The Apache or nginx service will later need to know where to find files to serve
|
|
|
|
to web clients including other CollecTor instances.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
CollecTor does not require setting up a database.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
This concludes the host setup. Later in the process you'll once more need root
|
2016-11-03 17:20:03 +00:00
|
|
|
privileges to configure Apache or nginx to serve CollecTor files. But until
|
|
|
|
then you can do all setup steps with the non-privileged user account.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
## Setting up the service
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
### Obtaining the code
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
CollecTor releases are available at:
|
2016-06-03 13:32:44 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
```https://dist.torproject.org/collector/```
|
2015-05-16 09:40:25 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
Choose the latest tarball and signature file, verify the signature on the
|
|
|
|
tarball, and extract the tarball in a location of your choice which will create
|
|
|
|
a subdirectory called `collector-<version>/`.
|
2015-05-16 09:40:25 +00:00
|
|
|
|
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
### Planning the service setup
|
2016-08-09 09:51:36 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
By default, CollecTor is configured to do nothing at all. The reason is that
|
|
|
|
new operators should first understand its capabilities and make a plan for
|
|
|
|
configuring their new CollecTor instance. Let's do that now.
|
2016-08-09 09:51:36 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
CollecTor consists of a background updater with an internal scheduler and
|
|
|
|
several data-collecting modules that write data to local directories which are
|
|
|
|
then served by a webserver. Each of the modules can have one or more data
|
|
|
|
sources, some public like relay descriptors served by the directory authorities
|
|
|
|
and some private like bridge descriptors uploaded to the bridge directory
|
|
|
|
authority.
|
2016-08-09 09:51:36 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
You'll have to decide which of the data-collecting modules you want to activate,
|
|
|
|
how often to execute these modules, and which data sources to collect data from.
|
2016-08-09 09:51:36 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
The release tarball contains an executable .jar file:
|
2016-08-11 12:06:58 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
```collector-<version>/generated/dist/collector-<version>.jar```
|
2016-08-11 12:06:58 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
Copy this .jar file into the working directory and run it:
|
2016-08-11 12:06:58 +00:00
|
|
|
|
2016-10-17 14:56:15 +00:00
|
|
|
```java -jar collector-<version>.jar```
|
|
|
|
|
|
|
|
CollecTor will print some text about not being able to find a configuration
|
|
|
|
file, which is understandable since there is no such file yet. It also writes a
|
|
|
|
fresh configuration file called `collector.properties` to the working directory
|
|
|
|
which contains defaults (that instruct CollecTor to do nothing).
|
|
|
|
|
|
|
|
Read through that file to learn about all available configuration options.
|
|
|
|
|
|
|
|
|
|
|
|
### Performing the initial run
|
|
|
|
|
|
|
|
When you have made a plan how to configure your CollecTor instance, edit the
|
|
|
|
`collector.properties` file, set it to run only once, activate all relevant
|
|
|
|
modules, check and possibly edit other options as needed, and save the file.
|
|
|
|
Run the Java process using:
|
|
|
|
|
|
|
|
```java -Xmx2g -DLOGBASE=<your-log-dir> -jar collector-<version>.jar
|
|
|
|
<your-collector.properties>```
|
|
|
|
|
|
|
|
The option `-Xmx2g` sets the maximum heap space to 2G, which is based on the
|
|
|
|
recommended 4G total RAM size for the host. If you have more memory to spare,
|
2016-11-03 17:20:03 +00:00
|
|
|
feel free to adapt this option as needed. Note that there is no option to limit
|
|
|
|
the amount of disk space used.
|
2016-10-17 14:56:15 +00:00
|
|
|
|
|
|
|
This may take a while, depending on which modules you activated. Read the logs
|
|
|
|
to learn if the run was successful. If it wasn't, go back to editing the
|
|
|
|
properties file and re-run the .jar file. Change the run-once option back when
|
|
|
|
you're done with the initial run of the Java process.
|
|
|
|
|
|
|
|
Complete the initialization step by copying the shell script
|
|
|
|
`collector-<version>/src/main/resources/create-tarballs.sh` from the release
|
|
|
|
tarball to the working directory or another location of your choice, editing the
|
|
|
|
contained paths, and executing it. Note that this script will at least partly
|
2016-11-03 17:20:03 +00:00
|
|
|
fail if one or more modules are deactivated, and that if you haven't edited any
|
|
|
|
paths, the script will write to `/srv/collector.torproject.org/collector/`.
|
2016-10-17 14:56:15 +00:00
|
|
|
|
|
|
|
|
|
|
|
### Scheduling periodic runs
|
|
|
|
|
|
|
|
The next step in setting up the CollecTor instance is to start the updater with
|
|
|
|
its internal scheduler and let it run continuously in the background. In order
|
|
|
|
to do so, make sure the run-once property is set to `false`, possibly adapt the
|
|
|
|
scheduling properties, and execute the .jar file using the same command as above
|
|
|
|
but this time in the background. Make sure that the same command will be run
|
|
|
|
automatically after a reboot.
|
|
|
|
|
|
|
|
Also make sure that the `create-tarballs.sh` script will be executed at least
|
|
|
|
every three days, but no more than once per day.
|
|
|
|
|
|
|
|
### Setting up the website
|
|
|
|
|
|
|
|
The last remaining part in the setup process is to make the collected data
|
|
|
|
available. Copy the contents from `collector-<version>/src/main/webapp/*` in
|
|
|
|
the release tarball to a web application subdirectory in the working directory
|
|
|
|
or another location of your choice.
|
|
|
|
|
|
|
|
Configure an Apache site that uses redirects or symbolic links to serve the
|
|
|
|
following directories or files in your working directory (where paths in <>
|
|
|
|
refer to settings in `collector.properties`):
|
|
|
|
|
|
|
|
* `<your-webapp-dir>/*`,
|
|
|
|
* `<ArchivePath>`,
|
|
|
|
* `<IndexPath>`, and
|
|
|
|
* `<RecentPath>`.
|
|
|
|
|
2016-11-03 17:20:03 +00:00
|
|
|
You can also configure nginx as the web server of your choice. If you use
|
|
|
|
nginx you will need to use the FancyIndex module to be able to include the
|
|
|
|
provided footer and header of the webapp. Copy
|
2016-11-17 13:35:08 +00:00
|
|
|
`collector-<version>/src/main/resources/nginx-collector` to
|
2016-11-03 17:20:03 +00:00
|
|
|
`/etc/nginx/sites-available/` and make a symbolic link in
|
|
|
|
`/etc/nginx/sites-enabled/` to enable it.
|
|
|
|
|
|
|
|
Now, use your browser to make sure that your instance serves the web pages and
|
|
|
|
data that you'd expect.
|
2016-10-17 14:56:15 +00:00
|
|
|
|
|
|
|
|
|
|
|
## Maintaining the service
|
|
|
|
|
|
|
|
### Monitoring the service
|
|
|
|
|
|
|
|
The most important information about your CollecTor instance is whether it is
|
|
|
|
alive. Otherwise, if it dies and you don't notice, you might be losing data
|
|
|
|
that is not available at the data sources anymore. You should set up a
|
|
|
|
notification mechanism of your choice to be informed quickly when the background
|
|
|
|
updater dies.
|
|
|
|
|
|
|
|
Other than fatal issues, a good source for learning about issues with your
|
|
|
|
CollecTor instance are its logs. Be sure to read the logs every now and then,
|
|
|
|
and look out for warnings and errors. Maybe set up another notification to be
|
|
|
|
informed quickly of new warnings or errors.
|
|
|
|
|
|
|
|
|
|
|
|
### Changing logging options
|
|
|
|
|
|
|
|
CollecTor uses Logback for logging and comes with a default logging
|
|
|
|
configuration that logs on info level and that creates a common log file that
|
|
|
|
rotates once per day and a separate log file per module. If you want to change
|
|
|
|
logging options, copy the default logging configuration from
|
|
|
|
`collector-<version>/src/main/resources/logback.xml` to your working directory,
|
|
|
|
edit your copy, and execute the .jar file as follows:
|
|
|
|
|
|
|
|
```java -Xmx2g -DLOGBASE=<your-log-dir> -jar -cp .:collector-<version>.jar
|
|
|
|
org.torproject.collector.Main```
|
|
|
|
|
|
|
|
Internally, CollecTor uses the Simple Logging Facade for Java (SLF4J) and ships
|
|
|
|
with the Logback implementation for SLF4J. If you prefer a different logging
|
|
|
|
framework, you can provide and use that instead. For more detailed information,
|
|
|
|
or if you have different logging needs, please refer to the [Logback
|
|
|
|
documentation](http://logback.qos.ch/), and for switching to a different
|
|
|
|
framework to the [SFL4J website](http://www.slf4j.org/).
|
|
|
|
|
|
|
|
|
|
|
|
### Changing configuration options
|
|
|
|
|
|
|
|
If you need to reconfigure your CollecTor instance, you may be able to do that
|
|
|
|
without stopping and restarting the Java process. Scheduling settings are
|
|
|
|
exempt from this, but all general and module settings may be changed at
|
|
|
|
run-time. Just edit the config file, and the changes will become effective in
|
|
|
|
the next execution of a module. Changes to the scheduler, however, require
|
|
|
|
stopping and restarting the Java update process.
|
|
|
|
|
|
|
|
|
|
|
|
### Stopping the service (gracefully)
|
|
|
|
|
|
|
|
If you need to stop the background updater for some reason, like rebooting the
|
|
|
|
host, there is a way to do that gracefully: kill the Java process, and a
|
|
|
|
shutdown hook will stop the internal scheduler and wait for up to 10 minutes (or
|
|
|
|
whatever amount of time is configured) for all currently running updates to be
|
|
|
|
finished. However, if you must stop the process immediately, use `kill -9`,
|
|
|
|
though you might have to clean up manually. You should try to avoid rebooting
|
|
|
|
while tarballs are being created.
|
|
|
|
|
|
|
|
|
|
|
|
### Upgrading and downgrading
|
|
|
|
|
|
|
|
If you need to upgrade to a newer release or downgrade to a previous release,
|
|
|
|
download that tarball and extract it, and copy over the executable .jar file and
|
|
|
|
the `create-tarballs.sh` script in case it has changed. Stop the current
|
|
|
|
service version as described above, possibly adapt your `collector.properties`
|
|
|
|
file as necessary, and restart the Java process using the new .jar file. Don't
|
|
|
|
forget to update the version number in the command that ensures that the .jar
|
|
|
|
file gets executed automatically after a reboot. Watch the logs to see if the
|
|
|
|
upgrade or downgrade was successful.
|
|
|
|
|
|
|
|
|
|
|
|
### Backing up data and settings
|
|
|
|
|
|
|
|
A backup of your CollecTor instance should include the <ArchivePath> and your
|
|
|
|
configuration, which would enable you to set up this instance again. A backup
|
|
|
|
for short term recovery would also include the more volatile data in
|
|
|
|
<StatsPath>, <RecentPath>, and <OutputPath>.
|
|
|
|
|
|
|
|
|
|
|
|
### Performing recurring tasks
|
|
|
|
|
|
|
|
Most of CollecTor is designed to just run in the background forever. However,
|
|
|
|
some parts still require manual housekeeping every month or two: You'll need to
|
|
|
|
clean up data from `<OutputPath>` as configured in `collector.properties` when
|
|
|
|
you're certain that the data is contained in tarballs and contained in backups.
|
|
|
|
Likewise, you'll have to delete old files from `<BridgeLocalOrigins>`, in case
|
|
|
|
that is being used, where CollecTor only reads and never writes or deletes.
|
|
|
|
|
|
|
|
|
|
|
|
### Resolving common issues
|
|
|
|
|
|
|
|
Unfortunately, CollecTor still runs into issues from time to time, and some of
|
|
|
|
these issues require a human being to decide whether they're harmless or require
|
|
|
|
intervention by the operator.
|
|
|
|
|
|
|
|
The most common issue these days is a warning about missing too many referenced
|
|
|
|
descriptors, which may even be true but which is typically not an operations
|
|
|
|
issue.
|
|
|
|
|
|
|
|
A lot less frequently, the bridgedesc module reports unrecognized lines in
|
|
|
|
non-sanitized bridge descriptors which, if true, requires developing and
|
|
|
|
deploying a patch. And sometimes the bridgedesc module complains about stale
|
|
|
|
input data, which requires fixing the bridge authority or the sync mechanism to
|
|
|
|
the CollecTor host.
|
|
|
|
|
|
|
|
Another minor issue is that files in `<OutputPath>` may change while tarballs
|
|
|
|
are being created, which is usually safe to ignore.
|
|
|
|
|
|
|
|
There's another frequent error message where CollecTor complains about not being
|
|
|
|
able to fetch a remote file during the sync process. This error message is
|
|
|
|
usually harmless and can be ignored.
|
|
|
|
|
|
|
|
But let's hope that you won't run into any of these issues or at least not
|
|
|
|
frequently. Enjoy your new CollecTor instance!
|
2016-08-11 12:06:58 +00:00
|
|
|
|