Webis-Wikipedia-Text-Reuse-18

Synopsis

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl

Download

You can access the "Within Wikipedia Text Reuse" corpus on Zenodo.

  • wikipedia.jsonl.bz2 (3.0 GB)

    - Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body

  • within-wikipedia-tr-01.jsonl.bz2 (3.0 GB)

    - Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

  • within-wikipedia-tr-02.jsonl.bz2 (2.5 GB)

    - Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

You can access the "Commoncrawl & Wikipedia Text Reuse" corpus on Zenodo.

  • preprocessed_web_sample.jsonl.xz (34.9 GB)

    - Each line, representing a web page, contains a json object of page_id, page_url, and content

  • without-wikipedia-tr.jsonl.bz2 (190.5 MB)

    - Each line, representing a text reuse case, contains a json array of s_id (Wikipedia page id), t_id (The web page id), s_text (source text), t_text (target text)

Research

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

People

Publications