Webis Chatnoir-Copycat 2021

Synopsis

The Webis Chatnoir-Copycat 2021 dataset contains information on automatically detected near-duplicate documents (with SimHash) within the ClueWeb09, the ClueWeb12, two snapshots of the Common Crawl, as well as between selected pairs of these corpora. The dataset addresses several issues:

Access

Please refer to the publications for citing the dataset.

People