Webis-Wikipedia-Text-Reuse-18

Synopsis

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl

Download

Within Wikipedia Text Reuse: To download the corpus use the following links:

  • wikipedia.tar.gz (3.6 GB)

    - Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body

  • within-wikipedia-tr-01.gz (4.4 GB)

    - Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

  • within-wikipedia-tr-02.gz (3.7 GB)

    - Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

Commoncrawl & Wikipedia Text Reuse: To download the corpus use the following links:

Coming soon!

Research

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

People

Publications