Onionoo design document ======================= This short document describes Onionoo's design in a mostly informal and language-independent way. The goal is to be able to discuss design decisions with non-Java programmers and to provide a blueprint for porting Onionoo to other programming languages. This document cannot describe all the details, but it can provide a rough overview. There are two main building blocks of Onionoo that are described here: 1) an hourly cronjob processing newly published Tor descriptors and 2) a web service component answering client requests. The interface between the two building blocks is a directory in the local file system that can be read and written by component 1 and can be read by component 2. In theory, the two components can be implemented in two entirely different programming languages. In a possible port from Java to another programming language, the two components can easily be ported subsequently. The purpose of the hourly batch processor is to read updated Tor descriptors from the metrics service and transform them to be read by the web service component. Answering a client request in component 2 of Onionoo needs to be highly efficient which is why any data aggregation needs to happen beforehand. Parsing descriptors on-the-fly is not an option. The hourly batch processor is run in a cron job at :15 every hour that usually takes up to five minutes and that contains the following substeps: 1.1) Rsync new Tor descriptors from metrics. 1.2) Read previously stored status data about relays and bridges that have been running in the last seven days to memory. These data include for each relay or bridge: nickname, fingerprint, primary OR address and port, additional OR addresses and ports, exit addresses, network status publication time, directory port, relay flags, consensus weight, country code, host name as obtained by reverse domain name lookup, and timestamp of last reverse domain name lookup. 1.3) Import any new relay network status consensuses that have been published since the last run. 1.4) Set the running bit for all relays that are contained in the last known relay network status consensus. 1.5) Look up all relay IP addresses in a local GeoIP database and in a local AS number database. Extract country codes and names, city names, geo coordinates, AS name and number, etc. 1.6) Import any new bridge network statuses that have been published since the last run. 1.7) Start reverse domain name lookups for all relay IP addresses. Run in background, only refresh lookups for previously looked up IP address every 12 hours, run up to five lookups in parallel, and set timeouts for single requests and for the general lookup process. In theory, this step could happen a few steps before, but not before step 1.3. 1.8) Import any new relay server descriptors that have been published since the last run. 1.9) Import any new exit lists that have been published since the last run. 1.10) Import any new bridge server descriptors that have been published since the last run. 1.11) Import any new bridge pool assignments that have been published since the last run. 1.12) Make sure that reverse domain name lookups are finished or the timeout for running lookups has expired. This step cannot happen at any time later than step 1.13 and shouldn't happen long before. 1.13) Rewrite all details files that have changed. Details files combine information from all previously imported descriptory types, database lookups, and performed reverse domain name lookups. The web service component needs to be able to retrieve a details file for a given relay or bridge without grabbing information from different data sources. It's best to write the details file part for a give relay or bridge to a single file in the target JSON format, saved under the relay's or bridge's fingerprint. If a database is used, the raw string should be saved for faster processing. 1.14) Import relays' and bridges' bandwidth histories from extra-info descriptors that have been published since the last run. There must be internally stored bandwidth histories for each relay and bridge, regardless of whether they have been running in the last seven days. The original bandwidth histories, which are available on 15-minute detail, can be aggregated to longer time periods the farther the interval lies in the past. The interal bandwidth histories are different from the bandwidth files described in 1.15 which are written to be given out to clients. 1.15) Rewrite bandwidth files that have changed. Bandwidth files aggregate bandwidth history information on varying levels of detail, depending on how far observations lie in the past. It's inevitable to write JSON-formatted bandwidth files for all relays and bridges in the hourly cronjob. Any attempts to process years of bandwidth data while answering a web request can only fail. The previously aggregated bandwidth files are stored under the relay's or bridge's fingerprint for quick lookup. 1.16) Update the summary file listing all relays and bridges that have been running in the last seven days which was previously read in step 1.2. This is the last step in the hourly process. The web service component checks the modification time of this file to decide whether it needs to reload its view on the network. If this step was not the last step, the web service component might list relays or bridges for which there are no details or bandwidth files available yet. (With the approach taken here, it's conveivable that a bandwidth file of a relay or bridge that hasn't been running for a week has been deleted before step 1.16. This case has been found acceptable, because it's highly unlikely. If a database would have been used, steps 1.2 to 1.16 would have happened in a single database transaction.) The web service component has the purpose of answering client requests. It uses previously prepared data from the hourly cronjob to respond to requests very quickly. During initialization, or whenever the hourly cronjob has finished, the web service component does the following substeps: 2.1) Read the summary file that was produced by the hourly cronjob in step 1.16. 2.2) Keep the list of relays and bridges in memory, including all information that is used for filtering or sorting results. 2.3) Prepare summary lines for all relays and bridges. The summary resource is a JSON file with a single line per relay or bridge. This line contains only very few fields as compared to details files that a client might use for further filtering results. When responding to a request, the web service component does the following steps: 2.4) Parse request and its parameters. 2.5) Possibly filter relays and bridges. 2.6) Possibly re-order and limit results. 2.7) Write response or error code. Again, (and this can hardly be overstated!) steps 2.4 to 2.7 need to happen *extremely* fast. Any steps that go beyond file system reads or simple database lookups need to happen either in the hourly cronjob (1.1 to 1.16) or in the web service component initialization (2.1 to 2.3).