Update proposal to bring it more in-line with implementation.

This commit is contained in:
Mike Perry 2009-09-01 20:13:52 -07:00
parent 92c40e9727
commit 0b439985f4

View File

@ -2,7 +2,7 @@ Filename: 151-path-selection-improvements.txt
Title: Improving Tor Path Selection
Author: Fallon Chen, Mike Perry
Created: 5-Jul-2008
Status: Draft
Status: Implemented
Overview
@ -22,51 +22,37 @@ Implementation
Storing Build Times
Circuit build times will be stored in the circular array
'circuit_build_times' consisting of uint16_t elements as milliseconds.
The total size of this array will be based on the number of circuits
Circuit build times are stored in the circular array
'circuit_build_times' consisting of uint32_t elements as milliseconds.
The total size of this array is based on the number of circuits
it takes to converge on a good fit of the long term distribution of
the circuit builds for a fixed link. We do not want this value to be
too large, because it will make it difficult for clients to adapt to
moving between different links.
From our initial observations, this value appears to be on the order
of 1000, but will be configurable in a #define NCIRCUITS_TO_OBSERVE.
The exact value for this #define will be determined by performing
goodness of fit tests using measurments obtained from the shufflebt.py
script from TorFlow.
From our observations, this value appears to be on the order of 1000,
but is configurable in a #define NCIRCUITS_TO_OBSERVE.
Long Term Storage
The long-term storage representation will be implemented by storing a
The long-term storage representation is implemented by storing a
histogram with BUILDTIME_BIN_WIDTH millisecond buckets (default 50) when
writing out the statistics to disk. The format of this histogram on disk
is yet to be finalized, but it will likely be of the format
'CircuitBuildTime <bin> <count>', with the total specified as
'TotalBuildTimes <total>'
writing out the statistics to disk. The format this takes in the
state file is 'CircuitBuildTime <bin-ms> <count>', with the total
specified as 'TotalBuildTimes <total>'
Example:
TotalBuildTimes 100
CircuitBuildTimeBin 1 50
CircuitBuildTimeBin 2 25
CircuitBuildTimeBin 3 13
CircuitBuildTimeBin 0 50
CircuitBuildTimeBin 50 25
CircuitBuildTimeBin 100 13
...
Reading the histogram in will entail multiplying each bin by the
BUILDTIME_BIN_WIDTH and then inserting <count> values into the
circuit_build_times array each with the value of
<bin>*BUILDTIME_BIN_WIDTH. In order to evenly distribute the
values in the circular array, a form of index skipping must
be employed. Values from bin #N with bin count C and total T
will occupy indexes specified by N+((T/C)*k)-1, where k is the
set of integers ranging from 0 to C-1.
For example, this would mean that the values from bin 1 would
occupy indexes 1+(100/50)*k-1, or 0, 2, 4, 6, 8, 10 and so on.
The values for bin 2 would occupy positions 1, 5, 9, 13. Collisions
will be inserted at the first empty position in the array greater
than the selected index (which may requiring looping around the
array back to index 0).
Reading the histogram in will entail inserting <count> values
into the circuit_build_times array each with the value of
<bin-ms> milliseconds. In order to evenly distribute the values
in the circular array, the Fisher-Yates shuffle will be performed
after reading values from the bins.
Learning the CircuitBuildTimeout
@ -77,14 +63,28 @@ Implementation
fitting the data using the estimators at
http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation.
The timeout itself will be calculated by solving the CDF for the
a percentile cutoff BUILDTIME_PERCENT_CUTOFF. This value
represents the percentage of paths the Tor client will accept out of
the total number of paths. We have not yet determined a good
cutoff for this mathematically, but 85% seems a good choice for now.
The timeout itself is calculated by using the Quartile function (the
inverted CDF) to give us the value on the CDF such that
BUILDTIME_PERCENT_CUTOFF (80%) of the mass of the distribution is
below the timeout value.
From http://en.wikipedia.org/wiki/Pareto_distribution#Definition,
the calculation we need is pow(BUILDTIME_PERCENT_CUTOFF/100.0, k)/Xm.
Thus, we expect that the Tor client will accept the fastest 80% of
the total number of paths on the network.
Detecting Changing Network Conditions
We attempt to detect both network connectivty loss and drastic
changes in the timeout characteristics. Network connectivity loss
is detected by recording a timestamp every time Tor either completes
a TLS connection or receives a cell. If this timestamp is more than
90 seconds in the past, circuit timeouts are no longer counted.
If more than MAX_RECENT_TIMEOUT_RATE (80%) of the past
RECENT_CIRCUITS (20) time out, we assume the network connection
has changed, and we discard all buildtimes history and compute
a new timeout by estimating a new Pareto curve using the
position on the Pareto Quartile function for the ratio of
timeouts.
Testing
@ -104,7 +104,7 @@ Implementation
of a new circuit, and the hard cutoff triggers destruction of the
circuit.
Good values for hard and soft cutoffs seem to be 85% and 65%
Good values for hard and soft cutoffs seem to be 80% and 60%
respectively, but we should eventually justify this with observation.
When to Begin Calculation