diff options
| -rw-r--r-- | proposals/ideas/xxx-exit-statistics.txt | 66 |
1 files changed, 45 insertions, 21 deletions
diff --git a/proposals/ideas/xxx-exit-statistics.txt b/proposals/ideas/xxx-exit-statistics.txt index 2168e01..86b5176 100644 --- a/proposals/ideas/xxx-exit-statistics.txt +++ b/proposals/ideas/xxx-exit-statistics.txt @@ -6,24 +6,32 @@ Status: Draft 1. Motivation -We propose to begin recording aggregate traffic information from exit -nodes which will be of use to researchers, similar to how the project -already collects statistics about its user population. - -Individual researchers have instrumented exit nodes on an ad-hoc basis -(e.g. [1]) but the project has historically been skeptical about both -the value and the safety of doing so (e.g. [2] warns of running afoul -of wiretapping statutes; I understand that the [1] paper was met with -severe criticism, but I can't find that right now). A principled, -network-wide policy and mechanism for collecting exit metrics can -satisfy the demand for data for research purposes while protecting our -users and minimizing additional legal risk to exit node operators. +We propose to collect additional aggregate traffic information from +exit nodes which will be of use to researchers, similar to how the +project already collects statistics about its user population. We +also propose to bring existing entry and exit statistics under the +same umbrella so that anonymity protection can be applied in a +principled fashion to the entire data set. There is some prior +discussion in tickets #6002 and #6003. + +Exit nodes currently collect statistics on the destination TCP ports +of exiting traffic, but this is insufficient information for research; +in particular, there have been repeated requests for information about +destination hosts. Individual researchers have instrumented exit +nodes on an ad-hoc basis (e.g. [1]) but the project has historically +been skeptical about both the value and the safety of doing so +(e.g. [2] warns of running afoul of wiretapping statutes; I understand +that the [1] paper was met with severe criticism, but I can't find +that right now). A principled, network-wide policy and mechanism for +collecting exit metrics can satisfy the demand for data for research +purposes while protecting our users and minimizing additional legal +risk to exit node operators. 2. Design -The metrics we propose to collect at each exit node are the number of -exiting TCP connections and total bytes transferred in each direction, -per day, categorized three different ways: +At each exit node, we propose to measure the number of exiting TCP +connections and total bytes transferred in each direction, per day, +categorized three different ways: * TCP port * "Public suffix" + 1 domain component of destination @@ -40,13 +48,29 @@ anonymity. At the same time, they will enable interesting analyses of what the network is used _for_, in much the same way that the existing entry-side metrics enable analysis of _who_ uses the network. -It may be appropriate to rationalize collection of entry-side metrics -at the same time; I am under the impression that much of what we know -about the user population is actually derived from directory queries, -not actual entry nodes. Collecting entry and exit information via the -same mechanism would facilitate applying "noise" (see below) +At the same time, we propose to rationalize entry-side data +collection, which currently relies on directory queries rather than +actual traffic to actual entry nodes. This will improve accuracy and +will also allow us to apply an anonymity-protection algorithm consistently to the entire data set available from the Metrics server. +Entry nodes (including bridges) should record entering TCP connections +and traffic volume, per day, categorized by: + + * Country of IP address of traffic source + * ASN of IP address of traffic source, and ASN of entry node + (subject to same caveat as above) + * "I am a bridge" flag + +(I think this is a superset of the information currently collected via +directory queries. If I am mistaken, please let me know.) + +All collected data should be passed through a differential-privacy +sanitization algorithm before it leaves the Tor process's memory, and +should then be uploaded to a central server (probably via a write-only +hidden service API) which applies a second layer of sanitization. See +below for further discussion. + 3. Security implications Any collection of information about the operation of the Tor network @@ -67,7 +91,7 @@ The theoretical framework that deals with this class of exposure is called _differential privacy_. It rests on two observations. First, it is impossible to guarantee that _no one's_ privacy will be compromised via the release of statistics -- consider the somewhat -contrived but still revealing case where the adversary already knows +contrived but still evocative case where the adversary already knows that Alice is 6cm shorter than the average citizen of Ruritania; publication of the Ruritanian average height reveals Alice's actual height. Note that Alice _does not_ have to be included in the average |
