Synopsis

The Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11) contains 7,859 candidate paraphrases obtained from Mechanical Turk crowdsourcing. The corpus is made up of 4,067 accepted paraphrases, 3,792 rejected non-paraphrases, and the original texts. These samples have formed part of PAN 2010 international plagiarism detection competition, but were not previously available separate to rest of the competition data.

Download

We provide the dataset as a single folder in a Zip archive. Each paraphrase is represented by three files, containing the original text (e.g.: "1-original.txt"), the paraphrase text (e.g.: "1-paraphrase.txt"), and a file containing metadata (e.g.: "1-metadata.txt"), with information about the task identifier, task author identifier, time taken, and whether the paraphrase was accepted or rejected.

To download the corpus use the following link:

Research

The original samples were extracted from Project Gutenberg, and range from 28 to 954 words in length. The example below demonstrates one of the cases.

Sample original text:

  • "I dipped into these pages, and as I read for the first time some of the odes of The Unknown Eros, I seemed to have made a great discovery: here was a whole glittering and peaceful tract of poetry which was like a new world to me."

Sample paraphrased text:

  • "I pored through these pages, and as I perused the lyrics of The Unknown Eros that I had never read before, I appeared to have found out something wonderful: there before me was an entire shining and calming extract of verses that were like a new universe to me."

Sample metadata:

  • HITId: 10ZS3NUGQA9S3TFNR35JUEVE5BZVJF
    WorkerId: A1YBZ0T0FK7IZO
    WorkTimeInSeconds: 345
    Paraphrase: Yes

For more information on the construction of the dataset see the publication below.

People

Students: Andreas Eiselt.

Publications