David Manura
CSE-498 (Adv. Networks)
2003-04-15

A Protocol-Independent Technique for Eliminating Redundant Network Traffic--A Review

The authors [SW02] expose an important, fundamental issue, but the authors fall short in addressing it. The issue is this: data sent between network nodes contain redundant information. Redundancy exists through correlation in the contents of distinct data packets through time and space. Exploiting this correlation could reduce network bandwidth utilization. Web proxy caches address this issue in part, but the unit of caching granularity in them is course, data similarity between and within content resources is not utilized, and traditional proxies are tied to a specific application-layer protocol. Data compression addresses this issue also, but distributed and dynamic systems, such as networks, have unique needs.

There are three sides to this paper, each of which could be a separate research topic. The first is the existence and nature of redundancy in network traffic. The researcher could examine a wide range of network traffic to identify the degree to and areas in which network traffic is redundant and then determine the reasons for this redundancy. For example, why is there a distinction between redundancy in incoming and outgoing traffic? This analysis could follow the style of Paxton [PF95] [Pa96] [Pa99] and of the related investigations in network self-similarity [CB96]. Second, upon this diagnosis, various algorithms, services, and protocol changes could be proposed with theoretical backing. Architectural guidelines could be developed for successfully handling the redundancy problem across application-layer protocols. Third, these proposals could be implemented and evaluated. Save to some degree the last topic, this paper seems to put all these ideas into one but executes none remarkably. There is some confusion about the main purpose of the paper. The title implies this paper is about a technique (issue two), but the first paragraph of Section 3 indicates that the focus is on diagnosis (issue one) not implementation (issue three); however, "architecture" (typically considered a mix of issues two and three) is then mentioned ambiguously. The title indicates the paper to be about a technique for "eliminating" redundant network traffic (issue two), but the first sentence of the abstract says that the paper is about a technique for "identifying" repetitive information (issue one). This distinction is further muddied with a discussion of the implementation, design, and performance evaluation of an implementation for issue one (not three) that follows, somewhat out of order, a solution to issue two.

How is redundancy formally defined in a distributed, dynamic system? Redundancy is described in quantitative terms throughout the paper. This left me puzzled. Is redundancy inversely related to entropy? or can data be redundant but still have low entropy? Does redundancy depend on the semantics of the data? Can redundancy exist in the timing of packet transmission as well as in the values of packet bits? These both contain information content. In additional, I suppose that an optimal algorithm (not that it is necessary to have) for eliminating redundancy would be computationally infeasible, more-so for a dynamic system.

Can the elimination of redundancy at the network layer affect the strength of security systems at the application layer? One, packets are being cached, possibly at intermediate locations. Two, an opponent could discover the presence of a given packet in the proxy cache by sending out an equivalent packet through that proxy and measuring the latency. The specifics of how this peculiarity can be exploited are not well known.

Why cache at the packet (network or data link) level and not at the transport level? Certainly, not all applications use TCP, but those that do typically have the concept of a continuous byte stream, which can be represented by multiple packet sequences, not all of which have the same signatures due to possible differences in packet sizes. A similar case can be made concerning web proxy caches vs. transport layer caching. Following the end-to-end argument, it seems that input from higher layers remains useful. In similar light, compression (e.g. zlib) would be more efficient if done at the application level instead of at the packet level as described here, and packet-level compression on multimedia streams is redundant. How does the proposed technique work in conjunction with higher-level issues such as TCP handshaking, sequence numbers, and congestion control?

The purpose of the proposed technique is said to be bandwidth reduction. Would a complete implementation indeed reduce bandwidth given the need for synchronization mechanisms? Are these same ideas useful elsewhere, such as in symmetric multiprocessing (SMP) systems?

The paper is useful at least for the questions it raises. Data redundancy reduction and packet caching in network communication are unique ideas, but much remains to be clarified and demonstrated.

[SW02] Neil T. Spring and David Wetherall. "A Protocol-Independent Technique for Eliminating Redundant Network Traffic." Proceedings of ACM SIGCOMM, 2000.

[PF95] Vern Paxson and Sally Floyd. "Wide Area Traffic: The Failure of Poisson Modeling." IEEE/ACM Transactions on Networking, 3(3), June 1995.

[Pa96] Vern Paxson. End-to-End Routing Behavior in the Internet. IEEE/ACM Trans. on Networking, 5(5):601-615, October 1997. (Earlier version in Proc. SIGCOMM 1996).

[Pa99] Vern Paxson. End-to-End Internet Packet Dynamics. IEEE/ACM Transactions on Networking, 7(3):277-292, June 1999. Original version in Proc. of ACM SIGCOMM, 1997.

[CB96] Mark Crovella and Azer Bestavros. Self-similarity in WWW traffic: evidence and possible causes. IEEE/ACM Transactions on Networking, December 1997. Original version in Proc. of ACM SIGMETRICS, 1996.