Webis-Wikipedia-Text-Reuse-18
Synopsis
The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl
Download
You can access the "Within Wikipedia Text Reuse" corpus on Zenodo.
-
wikipedia.jsonl.bz2 (3.0 GB)
- Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
-
within-wikipedia-tr-01.jsonl.bz2 (3.0 GB)
- Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
-
within-wikipedia-tr-02.jsonl.bz2 (2.5 GB)
- Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
You can access the "Commoncrawl & Wikipedia Text Reuse" corpus on Zenodo.
-
preprocessed_web_sample.jsonl.xz (34.9 GB)
- Each line, representing a web page, contains a json object of page_id, page_url, and content
-
without-wikipedia-tr.jsonl.bz2 (190.5 MB)
- Each line, representing a text reuse case, contains a json array of s_id (Wikipedia page id), t_id (The web page id), s_text (source text), t_text (target text)