Webis Chatnoir-Copycat 2021
Synopsis
The Webis Chatnoir-Copycat 2021 dataset contains information on automatically detected near-duplicate documents (with SimHash) within the ClueWeb09, the ClueWeb12, two snapshots of the Common Crawl, as well as between selected pairs of these corpora.
Access
Please refer to this publication for citing the dataset.
People
Publications