Detecting Phrase-level Duplication on the World Wide Web
Review by Baoning Wu


This paper describes the methods the authors used to detect a special
kind of spam that provides pages by stitching together sentences from a
repository.

Their work is important because more and more duplicate spam appear these
days, such as the clone of wikipedia or DMOZ. The authors did one more
step ahead, i.e., their algorithm can detect sentence-level or
phrase-level duplication besides the whole web page duplicate.

Their method is reasonable and two real large data sets are used for
experiments. They also showed that lots of dupliate content exist on
Web today.

Some weaknesses about this work is:
1. The original goal of this work is to detect spam pages. But the results
need human judgement to decide whether the duplication is legitimate. With
the billions of pages on the Web today, their manual evaluation is not
viable.

2. They should provide some ideas in the future work section to address
the big disadvantage of their methods: how to distinguish legitimate
copying from spamming behavior.

3. There are some parameters mentioned at the end of section 3.2, such as
five word phrases, I wonder it is necessary to show the performance of
different values so as to find the optimal one for this technique.

4. Since they want to detect copy of well-written sentences, the
preprocessing of these pages may need some improvement. For example,
delete java scripts, keep the periods as the signs of the end of
sentences.  And I am not sure why treating the document as a circle
(mentioned in section 3.2) is helpful.

5. For the candidates selecting, they choose the web sites with high page
count and small standard deviation. They need to justify why this is
reasonable. For web sites with fewer pages, it is still possible they have
copied content.