Add proposals 336 and 337.

This commit is contained in:
Nick Mathewson 2021-10-22 17:36:04 -04:00
parent 52a5e71527
commit ea41a66447
2 changed files with 225 additions and 0 deletions

View File

@ -0,0 +1,87 @@
```
Filename: 336-randomize-guard-retries.md
Title: Randomized schedule for guard retries
Author: Nick Mathewson
Created: 2021-10-22
Status: Open
```
# Introduction
When we notice that a guard isn't working, we don't mark it as retriable
until a certain interval has passed. Currently, these intervals are
fixed, as described in the documentation for `GUARDS_RETRY_SCHED` in
`guard-spec` appendix A.1. Here we propose using a randomized retry
interval instead, based on the same decorrelated-jitter algorithm we use
for directory retries.
The upside of this approach is that it makes our behavior in
the presence of an unreliable network a bit harder for an attacker to
predict. It also means that if a guard goes down for a while, its
clients will notice that it is up at staggered times, rather than
probing it in lock-step.
The downside of this approach is that we can, if we get unlucky
enough, completely fail to notice that a preferred guard is online when
we would otherwise have noticed sooner.
Note that when a guard is marked retriable, it isn't necessarily retried
immediately. Instead, its status is changed from "Unreachable" to
"Unknown", which will cause it to get retried.
For reference, our previous schedule was:
```
{param:PRIMARY_GUARDS_RETRY_SCHED}
-- every 10 minutes for the first six hours,
-- every 90 minutes for the next 90 hours,
-- every 4 hours for the next 3 days,
-- every 9 hours thereafter.
{param:GUARDS_RETRY_SCHED} --
-- every hour for the first six hours,
-- every 4 hours for the next 90 hours,
-- every 18 hours for the next 3 days,
-- every 36 hours thereafter.
```
# The new algorithm
We re-use the decorrelated-jitter algorithm from `dir-spec` section 5.5.
The specific formula used to compute the 'i+1'th delay is:
```
Delay_{i+1} = MIN(cap, random_between(lower_bound, upper_bound))
where upper_bound = MAX(lower_bound+1, Delay_i * 3)
lower_bound = MAX(1, base_delay).
```
For primary guards, we set base_delay to 30 seconds and cap to 6 hours.
For non-primary guards, we set base_delay to 10 minutes and cap to 36
hours.
(These parameters were selected by simulating the results of using them
until they looked "a bit more aggressive" than the current algorithm, but
not too much.)
The average behavior for the new primary schedule is:
```
First 1.0 hours: 10.14283 attempts. (Avg delay 4m 47.41s)
First 6.0 hours: 19.02377 attempts. (Avg delay 15m 36.95s)
First 96.0 hours: 56.11173 attempts. (Avg delay 1h 40m 3.13s)
First 168.0 hours: 83.67091 attempts. (Avg delay 1h 58m 43.16s)
Steady state: 2h 36m 44.63s between attempts.
```
The average behavior for the new non-primary schedule is:
```
First 1.0 hours: 3.08069 attempts. (Avg delay 14m 26.08s)
First 6.0 hours: 8.1473 attempts. (Avg delay 35m 25.27s)
First 96.0 hours: 22.57442 attempts. (Avg delay 3h 49m 32.16s)
First 168.0 hours: 29.02873 attempts. (Avg delay 5h 27m 2.36s)
Steady state: 11h 15m 28.47s between attempts.
```

View File

@ -0,0 +1,138 @@
```
Filename: 337-simpler-guard-usability.md
Title: A simpler way to decide, "Is this guard usable?"
Author: Nick Mathewson
Created: 2021-10-22
Status: Open
```
# Introduction
The current `guard-spec` describes a mechanism for how to behave when
our primary guards are unreachable, and we don't know which other guards
are reachable. This proposal describes a simpler method, currently
implemented in [Arti](https://gitlab.torproject.org/tpo/core/arti/).
(Note that this method might not actually give different results: its
only advantage is that it is much simpler to implement.)
## The task at hand
For illustration, we'll assume that our primary guards are P1, P2, and
P3, and our subsequent guards (in preference order) are G1, G2, G3, and
so on. The status of each guard is Reachable (we think we can connect
to it), Unreachable (we think it's down), or Unknown (we haven't tried
it recently).
The question becomes, "What should we do when P1, P2, and P3 are
Unreachable, and G1, G2, ... are all Unknown"?
In this circumstance, we _could_ say that we only build circuits to G1,
wait for them to succeed or fail, and only try G2 if we see that the
circuits to G1 have failed completely. But that delays in the case that
G1 is down.
Instead, the first time we get a circuit request, we try to build one
circuit to G1. On the next circuit request, if the circuit to G1 isn't
done yet, we launch a circuit to G2 instead. The next request (if the
G1 and G2 circuits are still pending) goes to G3, and so on. But
(here's the critical part!) we don't actually _use_ the circuit to G2
unless the circuit to G1 fails, and we don't actually _use_ the circuit
to G3 unless the circuits to G1 and G2 both fail.
This approach causes Tor clients to check the status of multiple
possible guards in parallel, while not actually _using_ any guard until
we're sure that all the guards we'd rather use are down.
## The current algorithm and its drawbacks
For the current algorithm, see `guard-spec` section 4.9: circuits are
exploratory if they are not using a primary guard. If such an
exploratory circuit is `waiting_for_better_guard`, then we advance it
(or not) depending on the status of all other _circuits_ using guards that
we'd rather be using.
In other words, the current algorithm is described in terms of actions
to take with given circuits.
For Arti (and for other modular Tor implementations), however, this
algorithm is a bit of a pain: it introduces dependencies between the
guard code and the circuit handling code, requiring each one to mess
with the other.
# Proposal
I suggest that we describe an alternative algorithm for handing circuits
to non-primary guards, to be used in preference to the current
algorithm. Unlike the existing approach, it isolates the guard logic a
bit better from the circuit logic.
## Handling exploratory circuits
When all primary guards are Unreachable, we need to try non-primary
guards. We select the first such guard (in preference order) that is
neither Unreachable nor Pending. Whenever we give out such a guard, if
the guard's status is Unknown, then we call that guard "Pending" until
the attempt to use it succeeds or fails. We remember when the guard
became Pending.
> Aside: None of the above is a change from our existing specification.
After completing a circuit, the implementation must check whether
its guard is usable. A guard is usable according to these rules:
Primary guards are always usable.
Non-primary guards are usable for a given circuit if every guard earlier
in the preference list is either unsuitable for that circuit
(e.g. because of family restrictions), or marked as Unreachable, or has
been pending for at least `{NONPRIMARY_GUARD_CONNECT_TIMEOUT}`.
Non-primary guards are unusable for a given circuit if some guard earlier
in the preference list is suitable for the circuit _and_ Reachable.
Non-primary guards are unusable if they have not become usable after
`{NONPRIMARY_GUARD_IDLE_TIMEOUT}` seconds.
If a circuit's guard is neither usable nor unusable immediately, the
circuit is not discarded; instead, it is kept (but not used) until it
becomes usable or unusable.
> I am not 100% sure whether this description produces the same behavior
> as the current guard-spec, but it is simpler to describe, and has
> proven to be simpler to implement.
## Implications for program design.
(This entire section is implementation detail to explain why this is a
simplification from the previous algorithm. It is for explanatory
purposes only and is not part of the spec.)
With this algorithm, we cut down the interaction between the guard code
and the circuit code considerably, but we do not remove it entirely.
Instead, there remains (in Arti terms) a pair of communication channels
between the circuit manager and the guard manager:
* Whenever a guard is given to the circuit manager, the circuit manager
receives the write end of a single-use channel to
report whether the guard has succeeded or failed.
* Whenever a non-primary guard is given to the circuit manager, the
circuit receives the read end of a single-use channel that will tell
it whether the guard is usable or unusable. This channel doesn't
report anything until the guard has one status or the other.
With this design, the circuit manager never needs to look at the list of
guards, and the guard manager never needs to look at the list of
circuits.
## Subtleties concerning "guard success"
Note that the above definitions of a Reachable guard depend on reporting
when the _guard_ is successful or failed. This is not necessarily the
same as reporting whether the _circuit_ is successful or failed. For
example, a circuit that fails after the first hop does not necessarily
indicate that there's anything wrong with the guard. Similarly, we can
reasonably conclude that the guard is working (at least somewhat) as
long as we have an open channel to it.