summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--proposals/ideas/xxx-exit-statistics.txt66
1 files changed, 45 insertions, 21 deletions
diff --git a/proposals/ideas/xxx-exit-statistics.txt b/proposals/ideas/xxx-exit-statistics.txt
index 2168e01..86b5176 100644
--- a/proposals/ideas/xxx-exit-statistics.txt
+++ b/proposals/ideas/xxx-exit-statistics.txt
@@ -6,24 +6,32 @@ Status: Draft
1. Motivation
-We propose to begin recording aggregate traffic information from exit
-nodes which will be of use to researchers, similar to how the project
-already collects statistics about its user population.
-
-Individual researchers have instrumented exit nodes on an ad-hoc basis
-(e.g. [1]) but the project has historically been skeptical about both
-the value and the safety of doing so (e.g. [2] warns of running afoul
-of wiretapping statutes; I understand that the [1] paper was met with
-severe criticism, but I can't find that right now). A principled,
-network-wide policy and mechanism for collecting exit metrics can
-satisfy the demand for data for research purposes while protecting our
-users and minimizing additional legal risk to exit node operators.
+We propose to collect additional aggregate traffic information from
+exit nodes which will be of use to researchers, similar to how the
+project already collects statistics about its user population. We
+also propose to bring existing entry and exit statistics under the
+same umbrella so that anonymity protection can be applied in a
+principled fashion to the entire data set. There is some prior
+discussion in tickets #6002 and #6003.
+
+Exit nodes currently collect statistics on the destination TCP ports
+of exiting traffic, but this is insufficient information for research;
+in particular, there have been repeated requests for information about
+destination hosts. Individual researchers have instrumented exit
+nodes on an ad-hoc basis (e.g. [1]) but the project has historically
+been skeptical about both the value and the safety of doing so
+(e.g. [2] warns of running afoul of wiretapping statutes; I understand
+that the [1] paper was met with severe criticism, but I can't find
+that right now). A principled, network-wide policy and mechanism for
+collecting exit metrics can satisfy the demand for data for research
+purposes while protecting our users and minimizing additional legal
+risk to exit node operators.
2. Design
-The metrics we propose to collect at each exit node are the number of
-exiting TCP connections and total bytes transferred in each direction,
-per day, categorized three different ways:
+At each exit node, we propose to measure the number of exiting TCP
+connections and total bytes transferred in each direction, per day,
+categorized three different ways:
* TCP port
* "Public suffix" + 1 domain component of destination
@@ -40,13 +48,29 @@ anonymity. At the same time, they will enable interesting analyses of
what the network is used _for_, in much the same way that the existing
entry-side metrics enable analysis of _who_ uses the network.
-It may be appropriate to rationalize collection of entry-side metrics
-at the same time; I am under the impression that much of what we know
-about the user population is actually derived from directory queries,
-not actual entry nodes. Collecting entry and exit information via the
-same mechanism would facilitate applying "noise" (see below)
+At the same time, we propose to rationalize entry-side data
+collection, which currently relies on directory queries rather than
+actual traffic to actual entry nodes. This will improve accuracy and
+will also allow us to apply an anonymity-protection algorithm
consistently to the entire data set available from the Metrics server.
+Entry nodes (including bridges) should record entering TCP connections
+and traffic volume, per day, categorized by:
+
+ * Country of IP address of traffic source
+ * ASN of IP address of traffic source, and ASN of entry node
+ (subject to same caveat as above)
+ * "I am a bridge" flag
+
+(I think this is a superset of the information currently collected via
+directory queries. If I am mistaken, please let me know.)
+
+All collected data should be passed through a differential-privacy
+sanitization algorithm before it leaves the Tor process's memory, and
+should then be uploaded to a central server (probably via a write-only
+hidden service API) which applies a second layer of sanitization. See
+below for further discussion.
+
3. Security implications
Any collection of information about the operation of the Tor network
@@ -67,7 +91,7 @@ The theoretical framework that deals with this class of exposure is
called _differential privacy_. It rests on two observations. First,
it is impossible to guarantee that _no one's_ privacy will be
compromised via the release of statistics -- consider the somewhat
-contrived but still revealing case where the adversary already knows
+contrived but still evocative case where the adversary already knows
that Alice is 6cm shorter than the average citizen of Ruritania;
publication of the Ruritanian average height reveals Alice's actual
height. Note that Alice _does not_ have to be included in the average