Review of Ratnasamy et al
Kiran Komaravolu

Ratnasamy et al's article describes scalable content addressable networks. A content addressable network (CAN) may be described as a distributed hash table like structure on an internet scale. The aim of content addressable networks is to provide look up services for a large database such as the content of a peer to peer network (it appears as if p2p networks are the prime target of CANs although the idea might be used in other areas too). The lookup service in existing peer to peer networks is either flawed or depends on a centralized lookup server.

The basic design of a CAN consists of a key-value pair. A D-dimensional area space is divided among the nodes. Each node is in control of a part of the area space called zone. A key is hashed to a point in the space and the node in charge of the zone that holds that particular point the space will hold the key-value pair. Each node has the information about its logical neighbors and uses this information to route requests. Each CAN message holds the source and destination addresses.

Any new node wishing to join a CAN must first locate a node already existing in the CAN. Then any random point in the space is chosen and a join request along with this point is sent to the CAN node. The CAN node routes this request to the node that holds the zone of this point. This zone is split into to two with the part containing the point made up as a new zone and assigned to the new node.

To improve performance the CAN topology and addresses could me mapped to the IP paths. Also increasing the dimension of the space reduces the routing paths and improves routing fault tolerance. Another modification is the introduction of realities, where multiple independent coordinate spaces are maintained. Each coordinate space is called a reality. A node belonging to a r-reality space will belong to a r number of coordinate spaces. Hash table spaces are replicated over each reality thereby improving availability.

The article also suggests a method for overloading coordinate zones where multiple nodes hold the same zone. This prevents a single point of failure for a heavily used zone. Also caching and replication of content has been suggested to prevent hot-spots and improved performance.

The article though misses a few key points. The problem of initialization of the network still exists. There is a need for a centralized server to locate a CAN node for a new node to join the CAN. There is no description of the data keys which will be stored in the nodes. No description of how the keys will actually be inserted is given. For a huge network it is possible (and probably always true) that a data key will return a very large value. A popular key might return all the IP addresses holding a particular file. It might not be a good idea for a single node to hold such a large data. Some nodes might not want to. Thus although the authors present a conceptually good idea, there are still some holes which need to be filled up. The paper though is wholeheartedly recommended for reading.

Also updates in the system are very hard to make. Assuming a new node joins the system. Its content needs to be advertised to the world. The hash function is run on each file and the results pertaining to each file are sent to a possibly different node. This might be a bit complex, especially when the node has joined the system for a short period.