Scott Weber
CSE 498

Review of "Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility"

This paper describes and evaluates a prototype of PAST, a peer-to-peer file storage system. PAST is an application built on top of the Pastry location and routing facilities. The objective of the PAST system is to provide a decentralized, peer-to-peer store for files that uses redundancy and implied geographic disparity between hosts to provide reliability and persistence.

Pastry is responsible for providing the ability to quickly locate and easily retrieve the data stored in the network. It uses a system similar to that of the Tapestry and Plaxton networks for routing of data and messages. While PAST has been targeted to run on top of Pastry, the authors claim that it should be capable of running on another routing scheme.

PAST relies on smartcards to provide security and authentication. Each node in the system and each writing user must have a smartcard with an identifying public/private key pair. The authors do a good job of making explicit some of their assumptions about security, including that breaking the public key cryptography system used is an intractable problem. Ignoring the error inherent in assuming that any cryptography system is unbreakable, the authors seem to miss another key security assumption. They assume that smartcards are reasonably secure. The problem that the authors do not discuss is that smartcards, being physical objects, are easily stolen, defeating whatever fancy encryption or authentication algorithms are available with relatively little effort. The choice of a smartcard as an identifier in the system is not obviously the right choice and the authors do not attempt to defend their decision. Perhaps the reason for using sm! artcards is described in one of the other papers on the security in PAST.

The amount of storage offered up by a given node in PAST is kept fairly homogeneous by an algorithm that rejects stores that are too big or too small. Comparing a possible new node's, store size with the average of the network, it is asked to split itself if it is too big or rejected out right if it is too small. While this most likely simplifies some algorithms for node selection and helps to balance the load across the system evenly, it is not entirely clear how this works. For example, how is a reasonable average size for a node store on the network determined? Can the system adapt if it sees that a lot of nodes are being rejected because their stores are too small, perhaps by requesting all nodes on the network to split themselves? Is there a threshold that is considered too big or too small for the store size?

Like many of the systems that we have examined, PAST also hashes the filename with a salt value to determine a unique identifier for the file in the system. Perhaps hashing on the contents of a file may provide a better method for uniquely identifying files. The content of a file is often a better indicator of what the file represents than the arbitrary description in the file name. Hashing on the content may also allow the detection of duplicate files and allow for more efficient storage.

For testing, the authors implemented their own "network emulation environment," but do not detail this decision. They should have stated why they did not use some preexisting software or at least given some proof as to the validity of the system they built. Since the measurements made examine the performance of storage management in PAST and not network performance, it may not matter how the network emulation environment performs, but if that is the case, it should be explicitly stated.

Figure 6 shows a scatter plot of file insertion failures versus utilization of the system's store. The scatter is fairly irregular for most of the graph, as would be expected in such of a properly working system. However, there is one file size near 1900000 bytes that, based on the graph, seems to have failed again and again at many utilization levels. This part of the graph stands out because it is so different from the rest, yet the authors never even mention it. A brief explanation would be appropriate.

The main focus of the paper was on the storage management facilities of PAST. I found these sections of the paper to be uninteresting and difficult to read. Diagrams of how the diversion systems work in a hypothetical situation would make the concepts easier to see and understand.

Mainly for this final reason, I would not recommend anyone else read this paper.

REFERENCE
Anthony Rowstron and Peter Druschel. "Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility." Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01), October 2001.