David Manura
CSE-498 (Adv. Networks)
2003-03-27

Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes^×A Review

The paper [1] measures the prevalence of self-similarity in web traffic and attempts to explain the reasons for this self-similarity. Four mathematical tests are applied, and all tests find that the degree of self-similarity (i.e. the Herst constant, H) are approximately within the range 0.7-0.8 during a single busy hour. The authors then argue that this self-similarity can be explained by a large number of concurrent ON-OFF processes (file transfers) having heavy-tailed ON and OFF durations. The heavy tailed character dominates in the ON process. Further, the effect of browser caches causes a significant correspondence between the set of file transfers and the set of available (at least once requested) files. Moreover, the set of such available files show similar file size distributions to that of typical sets of hosted files, which are known to be heavy tailed. Hence, the heavy tailed character in the set of hosted files tends to dominate the heavy tailed character in the set file transfers. The OFF process is argued to consist of two distinct processes: a low-frequency, machine-induced effect corresponding to requests for multiple files composing a single document and a low-frequency effect corresponding to user responses.

Although I believe the paper is fairly solid in its overall argument and execution, the utility of this research is weakly argued. We are told in the introduction that a proper understanding of network traffic is critical in the proper design and implementation of the web and that others have found it useful to evaluate the prevalence of self-similarity in other types of network traffic. I^Òm uncertain about the relative importance of self-similarity as a factor in modeling the behavior of web traffic. In modeling the web, we might, for example, consider a certain dominant low frequency burstiness (e.g. month of year on a college campus) and safely ignore all higher frequencies. Primarily, in what case is web modeling inaccurate or wrong due to neglect of self-similarity effects? In any case, the paper seems to suggest that a proper heavy-tailed distribution of document sizes is crucial, possibly more important than changes in user behavior, when simulating the web.

Overall, the argument is clearly presented for general reader and is neither is overly rigorous in this regard nor insufficient in the general argument. Exceptions, which are not immediately clear from the paper itself, are the definition of, or at least the significance of, the rescaled range (R/S) [2] [3] and the power spectrum [4]. Although these are likely described fully in the author^Òs citations, brief mention of them would improve the flow.

The results found seem to be self consistent (the authors apply four different statistical tests) and consistent with research by others. I believe that this argument was well supposed.

The paper attempts to find causes of self-similarity, but restriction to a 1 hour sampling time could obscure factors.

Network congestion and distribution of bandwidth, packet loss rates, and round- trip-times to remote servers are potentially important determinants in the behavior of network traffic. Are these heavy-tailed?

The authors indicate that such a study would be less easy today due to the increased prevalence of commercial browsers without available source code. Similar tests could be done at a proxy server if browser caches are disabled on local computers on a managed network. Alternately, each computer could run a local proxy, and this would maintain much of the original caching behavior. Netscape Navigator is of course no longer closed source, so a non-Windows lab could still be studied.

Some minor points follow. Graphs could have better captions. Section 4.3, paragraph 3, last sentence is missing a data value.

Overall, the paper presents effective evidence and justification for self- similarity.

[1] Mark E. Crovella and Azer Bestavros. Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. IEEE/ACM Transactions on Networking. December 1997. Original version in Proc. Of ACM SIGMETRICS, 1996. http://citeseer.nj.nec.com/crovella96selfsimilarity.html

[2] P. Vanouplines. Rescaled Range Analysis and the Dimensions of pi. 1995. http://gopher.ulb.ac.be/~pvouplin/pi/rswhat.htm

[3] Glossary of Terms Used in Time Series Analysis of Cardiovascular data. 1999. http://www.cbi.polimi.it/glossary/RescaledRange.html

[4] Shakrokh D. Yadegari. Self-Similar Synthesis on the Border Between Sound and Music. Master Thesis. MIT. Auguest 1992. http://crca.ucsd.edu/~syadegar/MasterThesis/node41.html