Scott Weber
CSE 498

Review of "Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design"


The Gnutella network is made up of varying random nodes from across the world, connected in an ad-hoc manner.  Prior to version 0.6 of the Gnutella protocol, there was no explicit control over the organization of the network.  Even with version 0.6, the minimal control asserted is optional and loosely bound.  As such, they layout and connectedness of the network are indeterminate; to understand the shape and size of the network, it must be actively crawled to be evaluated.  This paper analyzes the results of several crawls of the Gnutella network and tries to extract properties of the connected system.

The authors built two crawlers.  The first was sequential and took so long to complete that the network being crawled at the end was likely unrelated to the network being crawled at the beginning.  The second version used a distributed client/server architecture to allow the crawl to complete much faster.  It is not clear whether the individual crawl clients were run on separate computers or not.  The authors' description of using up to 50 clients for a given crawl would seem to indicate that more than one crawler was running on a given machine.  It would be nice to know whether running the clients on separate machines may affect the quality of the crawl.  For example, having each crawler come from a different IP address may change the dynamics of the Gnutella network as many different nodes attempt to connect.  With multiple IP addresses, it also becomes more difficult to filter out those connections that are imposed by the crawler from th! ose that pre-exist in the network.  Which brings up an important question: does an intrusive crawl of the Gnutella network change the properties of the network enough to make the data obtained useless?  Perhaps simply by viewing the network we change it.  A crawler behaves differently from a typical Gnutella client, tending toward shorter sessions with any given node and to more fully explore the other nodes on the network.

The authors found that in November 2000, only about one third of the traffic on the network was useful user traffic, the rest being overhead from maintenance messages.  A later crawl found that after June 2001 the traffic problems seem to have been remedied, with 92% of the traffic devoted to useful user data.  This claim is not backed up with data and does not correlate with an earlier statement that crawls were done only in November 2000, February/March 2001 and May 2001.  Neither do the authors attempt to find the cause of this drastic change in traffic patterns, simply saying that the problems were apparently "solved with the arrival of newer Gnutella implementations."  Many of the newer Gnutella clients support pong caching, so that rather than broadcasting ping messages, a node simply responds with the appropriate pong messages from its own local cache.  This scheme alone could be responsible for the vast improvement seen as stopping ! a ping early would greatly reduce the number of messages sent.  As shown in Figure 3 in the paper, almost every host on the entire network can be reached from any point in seven or fewer hops.  Due to the broadcast nature of Gnutella, the number of messages generated by a single original message grows exponentially.  Stopping a message with TTL=7 after only one hope reduces the total network traffic for that message by over 99%.

Figures 3, 5 and 6 could be somewhat misleading.  Figure 3 has two distinct peaks, with some crawls lining up with one peak and the other crawls with the second peak.  The lines and peaks are not labelled so no correlation can be made.  A label or description of which line represents which crawl would make this graph more valuable.  As for figures 5 and 6, the many different lines plotted on a single graph with no label makes it difficult to determine how any individual crawl matches with the power-law distribution.  While the general / average shows some power-law properties, it is almost impossible to tell whether a single given crawl matches at all.  The cluttered graph could be hiding a much differently shaped curve for a few or even many of the crawls.

Two comparisons are used to try to determine how well the Gnutella network correlates to the underlying Internet infrastructure: autonomous system numbers and domain names.  The autonomous system indicator is probably a pretty good one as AS numbers tend to incorporate larger networks that are closely connected.  The domain name comparison may not be quite as powerful.  While large organizations, such as universities and medium-sized companies, probably have the desired proximity properties, many other domains do not.  Consider a large company that uses the same domain for there offices in New York City, Los Angeles and London or the hosting company that has hundreds or thousands of domains within its local network.  Domain names are probably not as good an indicator of the underlying Internet infrastructure as are AS numbers.

This paper provides some interesting statistics and results pertaining to the properties of the Gnutella network.  It is well-written and easy to read, but lacks in detail in a few places.  Due to the dynamic nature of the network, generalizations must be made.  However, parts of this paper seem to over-generalize to make conclusions that are not supported by narrower data.  I would recommend this paper to anyone trying to analyze or crawl the Gnutella network.

REFERENCE
Matei Ripeanu, Ian Foster, and Adriana Iamnitchi. "Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design." IEEE Internet Computing, 6(1), 2002.