Webis-Wikipedia-Text-Reuse-18

Synopsis

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl

Download

You can access the "Within Wikipedia Text Reuse" corpus on Zenodo.

  • wikipedia.tar.gz (3.6 GB)

    - Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body

  • within-wikipedia-tr-01.gz (4.4 GB)

    - Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

  • within-wikipedia-tr-02.gz (3.7 GB)

    - Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

You can access the "Commoncrawl & Wikipedia Text Reuse" corpus on Zenodo.

  • preprocessed_web_sample.tar.gz (download)

    - Each line, representing a web page, contains a json array of page_id, page_url, and content

  • without-wikipedia-tr.zip (download)

    - Each line, representing a text reuse case, contains a json array of s_id (Wikipedia page id), t_id (The web page id), s_text (source text), t_text (target text)

Research

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

People

Publications