508 Commits

Author SHA1 Message Date
Karsten Loesing
7b026cf44a Allow underscore in transport names.
Example of a valid line that is now allowed:

bridge-ip-transports meek=32,obfs3_websocket=8,websocket=64
2014-07-22 09:21:00 +02:00
Karsten Loesing
73e5a6989d Avoid parsing descriptor contents to Maps.
Extra-info descriptors contain lots of comma-separated key=value lists
that we store in SortedMap instances.  But those occupy a lot of memory,
and it's not certain that we'll ever want to use the contained keys or
values.

New approach: when parsing a descriptor, use regular expressions to check
if lines are valid, and delay parsing into maps until needed.
2014-06-18 15:12:46 +02:00
Karsten Loesing
be359c873a Store relay flags more efficiently. 2014-06-18 15:12:46 +02:00
Karsten Loesing
f3a170fb74 Clear sets used to validate at-most-once/exactly-once keywords.
Related to 5caa384.  Similarly, keeping these sets around just wastes heap
space.
2014-06-18 15:12:46 +02:00
Karsten Loesing
c439d346b9 Avoid parsing descriptor contents to Lists or Sets.
If we can easily determine the number of List or Set elements, we can as
well store their contents in arrays and convert those to List or Set
instances when requested.  This can save us some memory and doesn't cost
much performance.
2014-06-18 15:12:41 +02:00
Karsten Loesing
557c2ccfd7 Always accept [SP|TAB]+ as delimiter instead of just SP.
Better fix for #12403.
2014-06-17 13:57:15 +02:00
Karsten Loesing
b1478b8fb5 Add unit tests for 2351cea. 2014-06-17 13:44:55 +02:00
Karsten Loesing
2351cead7a Accept [SP|TAB]+ as delimiter in two places.
There have been at least two relays including an additional SP after their
nickname in server and extra-info descriptors.  The spec stays vague about
whether this is allowed or not, but the directory authorities seem to
accept these descriptors just fine.  We should also accept these
descriptors.

We should probably accept [SP|TAB]+ in more places.  But right now we're
losing data by discarding these descriptors.  Let's do the quick fix now
and the potentially cleaner fix later.

Fixes #12403.
2014-06-16 07:32:55 +02:00
Karsten Loesing
a472d7f342 Store relay flags more efficiently.
Turns out a TreeSet<String> requires more memory than a String[].  We can
put together the TreeSet<String> when we need it.
2014-05-28 11:01:47 +02:00
Karsten Loesing
e3945981d1 Use a single DateFormat per thread and format.
DateFormat is not thread-safe, and creating a new instance every time we
need one only wastes CPU time.  Make sure that there's a single instance
per thread and format that the thread can use whenever it wants.
2014-05-28 11:01:47 +02:00
Karsten Loesing
5caa3848b0 Clear parsed keywords after verifying them.
No need to keep them around.  That's just a waste of heap space.
2014-05-28 11:01:22 +02:00
Karsten Loesing
a12e989e40 Store bandwidth histories more efficiently.
We were storing bandwidth histories in TreeMap<Long, Long>() with keys
being time in millis and values being bandwidth values.  This showed up in
profiles.  It's far more (memory-)efficient to store bandwidth values in a
long[] and put together the TreeMap when the caller requests it.  And if
the bandwidth history is evaluated exactly once, there should not even be
a CPU overhead.
2014-05-27 21:09:48 +02:00
Karsten Loesing
8722de7044 Make queue size of descriptor reader configurable.
By default, the descriptor reader puts up to 100 parsed descriptor files
in a queue in order to hand them out as quickly as possible.  But if
descriptor files contain hundreds or even thousands of descriptors, that
default may be too high.  Add a new method to make it configurable.
2014-05-27 20:53:04 +02:00
Karsten Loesing
b298cbcbd1 Fix encoding problem when parsing multi-descriptor files.
When we're parsing a descriptor file with potentially more than one
descriptor in it, we're converting file contents to String to be able to
search for descriptor beginnings using String methods.  But we're not
passing a character encoding, leaving it up to Java to guess.  What we
should do is tell it to use "US-ASCII" as encoding, which is sufficient to
find keywords marking the beginning of a new descriptor.

Fixes #11821.
2014-05-25 12:32:36 +02:00
Karsten Loesing
38c48ddd0c Parse micodesc consensuses and microdescriptors.
Required for implementing #2785.
2014-01-17 15:53:45 +01:00
Jens-Michael Hoffmann
3e60ccdaab Fix build errors on Debian systems.
The local lib directory is not used anymore and respective references
were removed. The java dependencies are now specified in the build.xml
and taken from their installed locations.

In addition to git, openjdk-6-jdk and ant the following java packages
have to be installed:
 - libcommons-codec-java
 - libcommons-compress-java
 - junit4

Minor tweaks by Karsten Loesing.
2013-07-31 16:40:06 +02:00
Karsten Loesing
008781b7e5 Add tests for published lines containing milliseconds.
Milliseconds are simply ignored, because SimpleDateFormat only looks at
"yyyy-MM-dd HH:mm:ss" and ignores everything after that.

Related to #9286 where we discovered that some relays include milliseconds
in their descriptors.
2013-07-18 14:21:36 +02:00
Karsten Loesing
60a066a0b0 Fast exits read/write more than MAX_INT KiB per day.
For example, see "other" entry in:

  exit-kibibytes-read 80=505190490,182=25102395,443=61873906,
    6881=47999666,8989=8657674,17173=7910494,21762=9138992,
    45682=5154543,50500=6086469,51413=62394452,other=2282907805
2013-07-08 12:53:53 +02:00
Karsten Loesing
e7f93e1a6a Restrict valid keyword characters to [A-Za-z0-9-]+.
Fixes #8798.
2013-05-03 15:33:29 +02:00
Karsten Loesing
b58211e577 Support bridge-ip-transports lines in extra-infos. 2013-04-19 20:57:31 +02:00
Karsten Loesing
5b21044819 Parse Unmeasured=1 in w lines of consensuses.
Pointed out by atagar.
2013-04-09 08:36:53 +02:00
Karsten Loesing
fdcf0b49a3 guard-tk actually stands for *weighted* time known. 2013-02-05 15:37:12 +01:00
Karsten Loesing
c2a0dbf8bf Parse the new flag-thresholds line in votes. 2013-02-05 14:49:28 +01:00
Karsten Loesing
785fd43246 Add parsing support for ntor-onion-key line.
Spotted by atagar; see #7867.
2013-01-07 05:24:11 +01:00
Karsten Loesing
895992549b Parse ipv6-policy lines in server descriptors.
Spotted by atager in related ticket #7826.
2012-12-30 19:54:06 +01:00
Karsten Loesing
17e9149f07 Add support for parsing bridge-ip-versions lines. 2012-11-08 14:18:18 -05:00
Karsten Loesing
43b9390250 Add support for parsing geoip6-db-digest lines. 2012-11-07 13:55:23 -05:00
Karsten Loesing
be27fef42e Looks like $fingerprint~nickname is also a valid family line entry.
Support for $fingerprint=nickname was previously added in 6a46f46.
2012-11-07 12:27:54 -05:00
Karsten Loesing
66cec8b01f Allow multiple "m" lines per network status entry. 2012-09-05 04:16:41 -04:00
Karsten Loesing
ba8cb725d2 Remove GetTor statistics parsing code. 2012-08-07 12:25:42 +02:00
Karsten Loesing
25f0e656c4 Accept transport lines containing more than just the transport name.
Sanitized bridge descriptors contain transport lines with just the
transport name.  However, there are now relays including unsanitized
transport lines, most likely because of a configuration problem.  Don't
reject the entire descriptor when encountering those lines.
2012-08-06 08:08:44 +02:00
Karsten Loesing
20f9d5574f Make parse history in descriptor reader more accessible.
So far, the only way to prevent files from being parsed repeatedly in
distinct runs was to specify a history file that only metrics-lib was
supposed to read and write.  However, some applications may want to
specify the list of files to exclude themselves, or they may want to learn
which files have been excluded and which have been parsed.  These
applications shouldn't be forced to mess with the history file.

Add three methods to the descriptor reader for these applications.  They
should also play nicely together with the history file approach.

AFAIK, stem has methods with the same purpose but a slightly different
semantic.
2012-07-21 12:11:47 +02:00
Karsten Loesing
ca201de75e Parse transport lines in bridge extra-infos. 2012-06-29 13:54:31 +02:00
Karsten Loesing
0c19088c4b We can parse all @type 1.x descriptor versions. 2012-06-29 13:29:05 +02:00
Karsten Loesing
2c3e59bb71 Tweak build file a bit. 2012-06-19 14:36:44 +02:00
Karsten Loesing
7348b3d208 Fix unit tests which were broken in 466725e. 2012-06-19 14:17:32 +02:00
Karsten Loesing
a3d89ee788 Support parsing GetTor statistics files. 2012-06-01 11:48:51 +02:00
Karsten Loesing
194768b33f Parse exit lists with @type annotation and Downloaded line. 2012-05-31 16:00:09 +02:00
Karsten Loesing
71f473962f Understand @type annotation in bridge pool assignments. 2012-05-31 12:02:48 +02:00
Karsten Loesing
1743e912c3 Parse sanitized bridge descriptor version 1.0. 2012-05-31 09:25:42 +02:00
Karsten Loesing
05a1cf7e7d Parse new .tpf Torperf data format. 2012-05-30 10:59:09 +02:00
Karsten Loesing
466725e2ea Parse v1 directories and contained server descriptors. 2012-05-19 19:30:28 +02:00
Karsten Loesing
49a88e7eaa Add @type annotations for sanitized bridge descriptors.
Spotted by Damian.
2012-05-19 11:48:29 +02:00
Karsten Loesing
26083ebde4 Fix unit tests.
- Annotation lines starting with @ are now recognized.
- Unrecognized keywords in "w" lines are now ignored.
2012-05-19 11:42:21 +02:00
Karsten Loesing
316e956bc0 Ignore unknown keywords in "w" lines.
moria1 added a Capped= keyword to debug #2286 which made DocTor and
metrics-db freak out.  The correct behavior is to ignore unknown keywords.
2012-05-19 10:14:54 +02:00
Karsten Loesing
01878416dc Correctly handle @type annotations when parsing descriptors. 2012-05-18 17:40:01 +02:00
Karsten Loesing
02fa685e9c Looks like blank lines are allowed in v2 statuses.
For the moment, we still disallow blank lines in all other descriptors.
If this is not correct, we can easily fix that.
2012-05-16 17:35:17 +02:00
Karsten Loesing
0eb47d2650 Add support for parsing v2 network statuses. 2012-05-16 17:01:30 +02:00
Karsten Loesing
20b1ef6378 Fix unit tests. 2012-05-16 16:50:15 +02:00
Karsten Loesing
5d67942706 Use the descriptor parser interface in the downloader, too. 2012-05-09 12:42:55 +02:00