Wednesday, November 18, 2009

Indexing and Anonymity

This is a follow up post to my ideal file sharing network that was originally posted as a comment on Reddit. It has a bit of overlap with my previous post on this topic, but I cover some additional technical details (hopefully in an understandable way) to explain why these problems are hard.

------------

We need to completely remove any remaining weak points in file sharing. From my count, there are only two more problems to solve to make BitTorrent "perfect":

1) Decentralized indexing/searching.

This is hard because distributed hash tables (not the "Mainline DHT" tracking system, but the actual hash algorithm used to map a key to a node), which are the underlying basis of BitTorrent cannot do inexact string matching easily. This is because, like all hash algorithms, a small change in the string makes a big change in the hash key. This prevents implementing a naive decentralized Kazaa/Limewire style search (these networks use a much more primitive base protocol that allows a search to simply "flood" the network N levels deep).

There is some research being done to solve this problem (e.g. http://www.computer.org/portal/web/csdl/doi/10.1109/ICDCS.2005.44). Overall, the future looks very promising for this problem and if the BitTorrent indexing services lose future lawsuits (which they probably will), this technology will be deployed.

2) Anonymity.

This is the "golden goose" of BitTorrent. Right now, the IP addresses of file sharers are directly exposed. This leaves them vulnerable to MAFIAA lawsuits and therefore fewer people are willing to seed (and many people are scared away altogether). Fewer seeders means slower torrents. If users could be certain that there is no possibility that they can be traced to a particular upload of copyrighted material, then the number of seeders in the network should rise dramatically as users just leave their torrents on without any fears.

The best way to do anonymity in a network right now is onion routing (e.g. Tor). Unfortunately, Tor has the problem that it is very slow. This is also mainly due to legal fears and poor incentives.

A brief overview of Tor for the uninitiated... Tor works by wrapping a message up in multiple encrypted layers (the "onion") and forwarding it along to several nodes. Each node in the route decrypts the outer layer (peeling away the "onion" layers) and then forwards it to the next node in the route. The last node (a.k.a. the exit node) then handles the request and sends the result back using a "reply onion" defined by the original requester. This means that no one in the onion route can connect the requester to the server that eventually receives the request, therefore making the requester anonymous. As long as all nodes in the onion route aren't cooperating (which is hard to do in practice because the requester can select any set of nodes it wants), then this is extremely secure. However, the weakness of the system is that it provides no incentive for peers to act as internal routers and especially as exit nodes (which is dangerous from a legal perspective because much of the traffic leaving Tor is illegal). More on Tor available here.

Tor's performance is weak on both of the classic networking metrics: latency and throughput. The high latency arises from the fact that each request has to make multiple hops across the Internet before reaching the destination. There is nothing that can be done about this and therefore any application that requires low latency is not suitable for onion routing. On the throughput side, Tor does decrease throughput somewhat (because nodes are sending and receiving more data than they would otherwise), but given that connections between nodes are rarely operating at the maximum rate of transfer (and this will improve even more as fiber to the home is deployed), the throughput of a Tor-like network should be high enough to support BitTorrent-style downloads (keep in mind, latency does not matter very much in BitTorrent applications). The key issue is to ensure that the onion routing network is not maxed out because there are too many free riders that are using onion router bandwidth but not contributing.

There are attempts to address this problem. The BitBlinder system tries to create incentives using an electronic cash infrastructure to require nodes to act as onion routers. This should increase the internal bandwidth available within the network and vastly improve performance. They claim to have achieved reasonable download speeds on BitTorrent over their network (which is still being developed and is currently available as a limited release). More info here.

Another approach is to use the Friend-to-Friend darknet design. This means that each node only connects to nodes it trusts and that untrusted parties will not be able to identify the node as the source of an upload. This approach is implemented in the OneSwarm project from the University of Washington.

The key difference between the BitBlinder and OneSwarm approach is trust. In BitBlinder, no one has to be trusted, but it requires the additional overhead of full onion routing. In OneSwarm, the overhead is that one has to traverse the trust network to reach data, but given the small-world phenomenon, this should require no more than six degrees to find any Kevin Bacon movie. ;)

Note: I am not completely familiar with the OneSwarm architectural details (the papers are buried in my overly long reading list), so if someone that knows it (preferably someone that helped develop it) can fill in the details with a replay, that would be great. I'm particularly interested to know how locating and routing are performed. Do you always only talk to trusted nodes or can there ever be short circuits in the trust network? Also, can the trusted nodes observe the traffic you are sending/receiving unencrypted?

There is a third additional feature to all of this that may or may not be necessary and that is traffic obfuscation. If the legal system decides that forwarding encrypted traffic is bad and orders the ISPs to use traffic analysis to block such traffic, then we will need to disguise the P2P traffic in some way. The simplest way I can think of is to just set up a bunch of fake VoIP/Videoconferencing calls among the friend-to-friend network and then just pad the stream with fake data whenever P2P traffic is not being sent. Also, the stream can be modified to have the same traffic analysis patterns as normal VoIP/Videoconferencing traffic. There are probably other protocols that can be used as well, but I selected VoIP/Videoconferencing for this example because they send lots of data.

Conclusion: the future of P2P is very bright. Short of extremely draconian legal measures that violate net neutrality and privacy (e.g. throwing someone in jail for forwarding encrypted traffic on behalf of others with no knowledge of what the traffic is), there is no way to take out the P2P networks of the near future. The pattern is clear: every time the MAFIAA makes progress in their lawsuits, the next level of the technology is released and we are only a couple of steps away from a network in which the MAFIAA has no one to sue except for their immediate neighbors in the network that can claim with plausible deniability that they did not knowing forward any copyrighted material.

No comments:

Post a Comment