Expand INSTALL.md.

Contains many suggestions by iwakeh. Implements #20380.
2024-11-26 19:00:38 +00:00 · 2016-10-17 16:56:15 +02:00 · 2016-10-17 16:56:15 +02:00 · 0ac703be63
commit 0ac703be63
parent 3081583d9f
3 changed files with 253 additions and 129 deletions
--- a/INSTALL.md
+++ b/INSTALL.md
@ -1,90 +1,281 @@
-CollecTor -- Operator's Guide
-=============================
+# CollecTor Operator's Guide

-Welcome to the Operator's Guide of CollecTor.  This guide explains how
-to set up a new CollecTor instance to download relay descriptors from the
-Tor directory authorities.
+Welcome to CollecTor, your friendly data-collecting service in the Tor network.
+CollecTor fetches data from various nodes and services in the public Tor network
+and makes it available to the world.  This data includes relay descriptors from
+the directory authorities, sanitized bridge descriptors from the bridge
+authority, and other data about the Tor network.
+
+This document describes how to set up your very own CollecTor instance.  It was
+written with an audience in mind that has at least some experience with running
+services and is comfortable with the command line.  It's not required that you
+know how to read or even write Java code, though.
+
+Before we go ahead with setting up your CollecTor instance, let us pause for a
+moment and reflect why you'd want to do that as opposed to simply using data
+from an existing CollecTor instance.
+
+CollecTor is a service, and the best reason for running a CollecTor service
+instance is to offer your collected Tor network data to others.  You could
+mirror the data from an existing instance or even aggregate data from multiple
+instances by using the synchronization feature.  Or you could fetch data from
+public sources and provide your data to users and other CollecTor instances.
+Another reason might be to collect or synchronize Tor network data and provide
+it to your working or research group.  And of course you might want to run a
+CollecTor instance for testing purposes.  In all these cases, setting up a
+CollecTor instance might make sense.
+
+However, if you only want to use Tor network data as a client, even as input for
+another service you're developing, you don't have to and probably shouldn't run
+a CollecTor instance.  In that case it's sufficient to use a library like
+[metrics-lib](https://dist.torproject.org/descriptor/) or
+[Stem](https://stem.torproject.org/) to fetch CollecTor data and process it.


-Requirements
------------
+## Setting up the host

-You'll need a Linux host with at least 50G disk space and 2G RAM.
+You'll need a host with at least 200G disk space and 4G RAM.

-In the following we'll assume that the host runs Debian stable as
-operating system, but it should work on any other Linux or possibly even
-*BSD.  Though you'll be mostly on your own with those.
+In the following we'll assume that your host runs Debian stable as operating
+system.  CollecTor should run on any other Linux or possibly even *BSD, though
+you'll be mostly on your own with those.  And as Java is available on a variety
+of other operating systems, those might work, too, but, again, you'll be on your
+own.

-As Java is available on a variety of other operating systems, these might
-work, too.  But again you'll be on your own.
+CollecTor does not require installing many or specific dependencies on the host
+system.  All it needs are a Java Runtime Environment version 7 or higher and an
+Apache HTTP Server version 2 or higher.

-Prepare the system
------------------
+The CollecTor service runs entirely under a non-privileged user account.  Any
+user account will do, but feel free to create a new user account just for the
+CollecTor service, if you prefer.

-CollecTor is provided by The Tor Project and can be found here:
-    https://dist.torproject.org/collector/
-Download the tar.gz file with the version number listed in build.xml.
-The README inside the tar.gz file has all the information about CollecTor
-and explains how to verify the downloaded files.
+The CollecTor service requires running in a working directory where it can store
+Tor network data and state files.  This working directory can be located
+anywhere in the file system as long as there is enough disk space available.
+The Apache service will later need to know where to find files to serve to web
+clients including other CollecTor instances.

-You need a Java installation.  On Debian you can just run:
+CollecTor does not require setting up a database.

-$ sudo apt-get openjdk-7-jdk
+This concludes the host setup.  Later in the process you'll once more need root
+privileges to configure Apache to serve CollecTor files.  But until then you can
+do all setup steps with the non-privileged user account.

-Configure the relay descriptor downloader
-----------------------------------------

-Run
-$ java -DLOGBASE=/path/to/logs -jar collector-<version>.jar
-once in order to obtain a configuration properties file.
+## Setting up the service

-There are quite a few options to set in collector.properties and the comments
-explain their meaning.  So, you can set the options to the values you want.
+### Obtaining the code

-Create the paths you set in collector.properties.
+CollecTor releases are available at:

-Example: run the relay descriptor downloader
--------------------------------------------
+```https://dist.torproject.org/collector/```

-This is a small example about how CollecTor is used.  All the other
-settings are explained in the default collector.properties.
+Choose the latest tarball and signature file, verify the signature on the
+tarball, and extract the tarball in a location of your choice which will create
+a subdirectory called `collector-<version>/`.

-For running the relay descriptor downloader:

-Edit collector.properties and set at least the following value to true:
+### Planning the service setup

-DownloadRelayDescriptors = true
+By default, CollecTor is configured to do nothing at all.  The reason is that
+new operators should first understand its capabilities and make a plan for
+configuring their new CollecTor instance.  Let's do that now.

-$ java -DLOGBASE=/path/to/logs -jar collector-<version>.jar </place/of/collector.properties>
+CollecTor consists of a background updater with an internal scheduler and
+several data-collecting modules that write data to local directories which are
+then served by a webserver.  Each of the modules can have one or more data
+sources, some public like relay descriptors served by the directory authorities
+and some private like bridge descriptors uploaded to the bridge directory
+authority.

-Watch out for INFO-level logs in the log directory you configured.  In
-particular, the lines following "Statistics on the completeness of written
-relay descriptors:" are quite important.
+You'll have to decide which of the data-collecting modules you want to activate,
+how often to execute these modules, and which data sources to collect data from.

-In case of the unforeseen ERROR and WARN level logs should help you troubleshoot
-your installation.
+The release tarball contains an executable .jar file:

-Maintenance
-----------
+```collector-<version>/generated/dist/collector-<version>.jar```

-CollecTor is designed to keep running and attempts to re-run modules even
-when previous runs stopped because of a problem.  Thus, it is very important
-to watch out for WARNING level and especially ERROR level log statements.
+Copy this .jar file into the working directory and run it:

-These often will point to problems you can do something about, e.g. a full disk
-or missing file system permissions.
+```java -jar collector-<version>.jar```

-Logging Configuration
---------------------
+CollecTor will print some text about not being able to find a configuration
+file, which is understandable since there is no such file yet.  It also writes a
+fresh configuration file called `collector.properties` to the working directory
+which contains defaults (that instruct CollecTor to do nothing).

-Some hints for those who are familiar with Logback:
+Read through that file to learn about all available configuration options.

-If you want to use your own logging configuration for Logback you can simply
-create your own logback.xml or logback.groovy and start CollecTor in the
-following way:

-java -cp /folder/with/logback:collector-1.0.0.jar org.torproject.collector.Main
- </place/of/collector.properties>
+### Performing the initial run
+
+When you have made a plan how to configure your CollecTor instance, edit the
+`collector.properties` file, set it to run only once, activate all relevant
+modules, check and possibly edit other options as needed, and save the file.
+Run the Java process using:
+
+```java -Xmx2g -DLOGBASE=<your-log-dir> -jar collector-<version>.jar
+<your-collector.properties>```
+
+The option `-Xmx2g` sets the maximum heap space to 2G, which is based on the
+recommended 4G total RAM size for the host.  If you have more memory to spare,
+feel free to adapt this option as needed.
+
+This may take a while, depending on which modules you activated.  Read the logs
+to learn if the run was successful.  If it wasn't, go back to editing the
+properties file and re-run the .jar file.  Change the run-once option back when
+you're done with the initial run of the Java process.
+
+Complete the initialization step by copying the shell script
+`collector-<version>/src/main/resources/create-tarballs.sh` from the release
+tarball to the working directory or another location of your choice, editing the
+contained paths, and executing it.  Note that this script will at least partly
+fail if one or more modules are deactivated.
+
+
+### Scheduling periodic runs
+
+The next step in setting up the CollecTor instance is to start the updater with
+its internal scheduler and let it run continuously in the background.  In order
+to do so, make sure the run-once property is set to `false`, possibly adapt the
+scheduling properties, and execute the .jar file using the same command as above
+but this time in the background.  Make sure that the same command will be run
+automatically after a reboot.
+
+Also make sure that the `create-tarballs.sh` script will be executed at least
+every three days, but no more than once per day.
+
+### Setting up the website
+
+The last remaining part in the setup process is to make the collected data
+available.  Copy the contents from `collector-<version>/src/main/webapp/*` in
+the release tarball to a web application subdirectory in the working directory
+or another location of your choice.
+
+Configure an Apache site that uses redirects or symbolic links to serve the
+following directories or files in your working directory (where paths in <>
+refer to settings in `collector.properties`):
+
+ * `<your-webapp-dir>/*`,
+ * `<ArchivePath>`,
+ * `<IndexPath>`, and
+ * `<RecentPath>`.
+
+Use your browser to make sure that your instance serves the web pages and data
+that you'd expect.
+
+
+## Maintaining the service
+
+### Monitoring the service
+
+The most important information about your CollecTor instance is whether it is
+alive.  Otherwise, if it dies and you don't notice, you might be losing data
+that is not available at the data sources anymore.  You should set up a
+notification mechanism of your choice to be informed quickly when the background
+updater dies.
+
+Other than fatal issues, a good source for learning about issues with your
+CollecTor instance are its logs.  Be sure to read the logs every now and then,
+and look out for warnings and errors.  Maybe set up another notification to be
+informed quickly of new warnings or errors.
+
+
+### Changing logging options
+
+CollecTor uses Logback for logging and comes with a default logging
+configuration that logs on info level and that creates a common log file that
+rotates once per day and a separate log file per module.  If you want to change
+logging options, copy the default logging configuration from
+`collector-<version>/src/main/resources/logback.xml` to your working directory,
+edit your copy, and execute the .jar file as follows:
+
+```java -Xmx2g -DLOGBASE=<your-log-dir> -jar -cp .:collector-<version>.jar
+org.torproject.collector.Main```
+
+Internally, CollecTor uses the Simple Logging Facade for Java (SLF4J) and ships
+with the Logback implementation for SLF4J.  If you prefer a different logging
+framework, you can provide and use that instead.  For more detailed information,
+or if you have different logging needs, please refer to the [Logback
+documentation](http://logback.qos.ch/), and for switching to a different
+framework to the [SFL4J website](http://www.slf4j.org/).
+
+
+### Changing configuration options
+
+If you need to reconfigure your CollecTor instance, you may be able to do that
+without stopping and restarting the Java process.  Scheduling settings are
+exempt from this, but all general and module settings may be changed at
+run-time.  Just edit the config file, and the changes will become effective in
+the next execution of a module.  Changes to the scheduler, however, require
+stopping and restarting the Java update process.
+
+
+### Stopping the service (gracefully)
+
+If you need to stop the background updater for some reason, like rebooting the
+host, there is a way to do that gracefully: kill the Java process, and a
+shutdown hook will stop the internal scheduler and wait for up to 10 minutes (or
+whatever amount of time is configured) for all currently running updates to be
+finished.  However, if you must stop the process immediately, use `kill -9`,
+though you might have to clean up manually.  You should try to avoid rebooting
+while tarballs are being created.
+
+
+### Upgrading and downgrading
+
+If you need to upgrade to a newer release or downgrade to a previous release,
+download that tarball and extract it, and copy over the executable .jar file and
+the `create-tarballs.sh` script in case it has changed.  Stop the current
+service version as described above, possibly adapt your `collector.properties`
+file as necessary, and restart the Java process using the new .jar file.  Don't
+forget to update the version number in the command that ensures that the .jar
+file gets executed automatically after a reboot.  Watch the logs to see if the
+upgrade or downgrade was successful.
+
+
+### Backing up data and settings
+
+A backup of your CollecTor instance should include the <ArchivePath> and your
+configuration, which would enable you to set up this instance again.  A backup
+for short term recovery would also include the more volatile data in
+<StatsPath>, <RecentPath>, and <OutputPath>.
+
+
+### Performing recurring tasks
+
+Most of CollecTor is designed to just run in the background forever.  However,
+some parts still require manual housekeeping every month or two: You'll need to
+clean up data from `<OutputPath>` as configured in `collector.properties` when
+you're certain that the data is contained in tarballs and contained in backups.
+Likewise, you'll have to delete old files from `<BridgeLocalOrigins>`, in case
+that is being used, where CollecTor only reads and never writes or deletes.
+
+
+### Resolving common issues
+
+Unfortunately, CollecTor still runs into issues from time to time, and some of
+these issues require a human being to decide whether they're harmless or require
+intervention by the operator.
+
+The most common issue these days is a warning about missing too many referenced
+descriptors, which may even be true but which is typically not an operations
+issue.
+
+A lot less frequently, the bridgedesc module reports unrecognized lines in
+non-sanitized bridge descriptors which, if true, requires developing and
+deploying a patch.  And sometimes the bridgedesc module complains about stale
+input data, which requires fixing the bridge authority or the sync mechanism to
+the CollecTor host.
+
+Another minor issue is that files in `<OutputPath>` may change while tarballs
+are being created, which is usually safe to ignore.
+
+There's another frequent error message where CollecTor complains about not being
+able to fetch a remote file during the sync process.  This error message is
+usually harmless and can be ignored.
+
+But let's hope that you won't run into any of these issues or at least not
+frequently.  Enjoy your new CollecTor instance!

-The default configuration can be found in the tar-ball you downloaded, in
-the subdirectory collector-1.0.0/src/main/resources.
--- a/README.md
+++ b/README.md
@ -1,62 +0,0 @@
-CollecTor -- The friendly data-collecting service in the Tor network
-====================================================================
-
-CollecTor fetches data from various nodes and services in the public
-Tor network and makes it available to the world.
-
-Verifying releases
------------------
-
-Releases can be cryptographically verified to get some more confidence that
-they were put together by a Tor developer.  The following steps explain the
-verification process by example.
-
-Download the release tarball and the separate signature file:
-
-```
-wget https://dist.torproject.org/collector/1.0.0/collector-1.0.0.tar.gz
-wget https://dist.torproject.org/collector/1.0.0/collector-1.0.0.tar.gz.asc
-```
-
-Attempt to verify the signature on the tarball:
-
-```
-gpg --verify collector-1.0.0.tar.gz.asc
-```
-
-If the signature cannot be verified due to the public key of the signer
-not being locally available, download that public key from one of the key
-servers and retry:
-
-```
-gpg --keyserver pgp.mit.edu --recv-key 0x4EFD4FDC3F46D41E
-gpg --verify collector-1.0.0.tar.gz.asc
-```
-
-If the signature still cannot be verified, something is wrong!
-
-But note that even if it can be verified, you now only know that the
-signature was made by the person claiming to own this key, which could be
-anyone.  You'll need a trust path to the owner of this key in order to
-trust this signature, but that's clearly out of scope here.  In short,
-your best chance is to meet a Tor developer in real life and enter the web
-of trust.
-
-If you want to go one step further in the verification game, you can
-verify the signature on the .jar files.
-
-Print and then import the provided X.509 certificate:
-
-```
-keytool -printcert -file CERT
-keytool -importcert -alias karsten -file CERT
-```
-
-Verify the signatures on the contained .jar files using Java's jarsigner
-tool:
-
-```
-jarsigner -verify collector-1.0.0.jar
-jarsigner -verify collector-1.0.0-sources.jar
-```
-
--- a/src/main/resources/create-tarballs.sh
+++ b/src/main/resources/create-tarballs.sh
@ -24,9 +24,7 @@ YEARTWO=`date --date='7 days ago' +%Y`
 MONTHTWO=`date --date='7 days ago' +%m`
 CURRENTPATH=`pwd`

-if ! test -d $WORKDIR
-  then mkdir $WORKDIR
-fi
+mkdir -p $WORKDIR

 cd $WORKDIR

@ -35,10 +33,7 @@ if ! test -d $OUTDIR
  exit 1
 fi

-if ! test -d $TARBALLTARGETDIR
-  then echo "$TARBALLTARGETDIR doesn't exist.  Exiting."
-  exit 1
-fi
+mkdir -p $TARBALLTARGETDIR

 TARBALLS=(
  exit-list-$YEARONE-$MONTHONE