Webis Chatnoir-Copycat 2021

Synopsis

The Webis Chatnoir-Copycat 2021 dataset contains information on automatically detected near-duplicate documents (with SimHash) within the ClueWeb09, the ClueWeb12, two snapshots of the Common Crawl, as well as between selected pairs of these corpora.

Access

Please refer to this publication for citing the dataset.

People

Publications