Webis Chatnoir-Copycat 2021
Synopsis
The Webis Chatnoir-Copycat 2021 dataset contains information on automatically detected near-duplicate documents (with SimHash) within the ClueWeb09, the ClueWeb12, two snapshots of the Common Crawl, as well as between selected pairs of these corpora. The dataset addresses several issues:
Access
Please refer to the publications for citing the dataset.
People
- Maik Fröbe
- Janek Bevendorff
- Lukas Gienapp
- Benno Stein
- Martin Potthast
- Matthias Hagen