Detecting Phrase-level Duplication on the World Wide Web Review by Baoning Wu This paper describes the methods the authors used to detect a special kind of spam that provides pages by stitching together sentences from a repository. Their work is important because more and more duplicate spam appear these days, such as the clone of wikipedia or DMOZ. The authors did one more step ahead, i.e., their algorithm can detect sentence-level or phrase-level duplication besides the whole web page duplicate. Their method is reasonable and two real large data sets are used for experiments. They also showed that lots of dupliate content exist on Web today. Some weaknesses about this work is: 1. The original goal of this work is to detect spam pages. But the results need human judgement to decide whether the duplication is legitimate. With the billions of pages on the Web today, their manual evaluation is not viable. 2. They should provide some ideas in the future work section to address the big disadvantage of their methods: how to distinguish legitimate copying from spamming behavior. 3. There are some parameters mentioned at the end of section 3.2, such as five word phrases, I wonder it is necessary to show the performance of different values so as to find the optimal one for this technique. 4. Since they want to detect copy of well-written sentences, the preprocessing of these pages may need some improvement. For example, delete java scripts, keep the periods as the signs of the end of sentences. And I am not sure why treating the document as a circle (mentioned in section 3.2) is helpful. 5. For the candidates selecting, they choose the web sites with high page count and small standard deviation. They need to justify why this is reasonable. For web sites with fewer pages, it is still possible they have copied content.