Synopsis

The Webis Chatnoir-Copycat 2021 dataset contains information on automatically detected near-duplicate documents (with SimHash) within the ClueWeb09, the ClueWeb12, two snapshots of the Common Crawl, as well as between selected pairs of these corpora. The dataset addresses several issues:

  • Make it as simple as possible for users to obtain a copy of their crawl with out near-duplicates
  • Support the transition of standard evaluation corpora (such as from ClueWeb09 to ClueWeb12) for web search tasks by enabling the transfer of annotated information to near-duplicates of a target crawl
  • Enable studies on near duplicates in static web corpora

Download

The datasets are stored in our ceph cluster and can be accessed by our public S3 endpoints. S3 allows direct download with HTTP and can be directly incorporated into big data tools like Hadoop and Spark.

Exclusion Lists (IDs to Remove)

The following files contain the IDs of documents that you can skip during the processing of your corpus to obtain a corpus without near-duplicates. Each document specified by an ID has a near-duplicate that remains in the crawl.

The data contains one ID per line and is splitted into multiple parts. To retrieve the first two documents to remove for ClueWeb09, run:

curl 'https://corpus-copycat.s3.data.webis.de/documents-to-remove/cw09-ids-to-remove-bzip2/part-00000.bz2' \
2>/dev/null |\
bzip2 -dc \
|head -2

Inclusion Lists

The following files contain all IDs of documents that you can process to obtain a corpus without near-duplicates. All documents on this inclusion list are never near-duplicates of each other, i.e. the documents specified in this list should remain in the crawl.

The data contains one ID per line and is splitted into multiple parts. To retrieve the first two documents from the inclusion list for ClueWeb09, run:

curl 'https://corpus-copycat.s3.data.webis.de/near-duplicate-free-inclusion-lists/cw09b/part-00000.bz2' \
2>/dev/null |\
bzip2 -dc \
|head -2

Pairs of Near-Duplicates

To make our work reproducible, we make the intermediate results of the final step of our pipeline (near-duplicates) available. Please note that you also need to take the groups of near-duplicates into consideration.

The data comes in the csv format first-id,second-id where first-id is a near-duplicate of second-id and first-id is the alphanumerical lower id. The pairs of ids are distinct (without duplicates), and the pair second-id,first-id is ommitted. To retrieve the first two pairs of near-duplicates for the ClueWebs, run:

curl 'https://corpus-copycat.s3.data.webis.de/near-duplicates/cw09-cw12/part-00000' 2>/dev/null |\
head -2

Groups of Near-Duplicates

To make our work reproducible, we make the intermediate results of step two of our pipeline available: documents with identical SimHash.

The data comes in the jsonl format and is splitted into multiple parts (i.e. each line is a valid json that represents one group of documents. To retrieve the first two groups of documents with identical fingerprint for ClueWeb09, run:

curl 'https://corpus-copycat.s3.data.webis.de/exact-duplicates/cw09/part-00000' \
2>/dev/null |\
head -2

Document Representations

To make our work reproducible, we publish the document representations that are the result of the first step of our pipeline. Each raw document is mapped to a representation with the fields: docId, url, canonicalURL, 64BitK3SimHashOneGrams, and 64BitK3SimHashThreeAndFiveGrams that we use for the near-duplicate detection with SimHash in subsequent steps of the pipeline.

The data comes in the jsonl format and is splitted into multiple parts (i.e. each line is a valid json that represents one document. To retrieve the first two document-representations for ClueWeb09, run:

curl 'https://corpus-copycat.s3.data.webis.de/document-representations/cw09/part-00000.bz2' \
2>/dev/null |\
bzip2 -dc \
|head -2

Transferred Relevance Judgments

To make our work reproducible, we make the transferred relevance judgments of the WEB and Session tracks available. For each transferred relevance judgment, we list only one document (in case there are multiple near-duplicates).

The data comes in the jsonl format. To retrieve the first two transferred relevance judgments, run:

curl 'https://corpus-copycat.s3.data.webis.de/relevance-transfer.jsonl' \
2>/dev/null |\
head -2

People