Scientific Author's Writing Style Corpus 2017


The Scientific Author's Writing Style Corpus 2017 is composed by 66 experiments in which three evaluators ranked four short text snippets ("targets") with regard to their similarity in writing style to one other snippet ("source"). The snippets were selected from the introduction of scientific articles written by single authors. Additionally, the snippets were manually checked for not having any clear hint on authorship for the evaluators.

For more information about the extraction of the corpus, please read the paper:

  • Andi Rexha, Mark Kröll, Hermann Ziak, and Roman Kern. "Extending Scientific Literature Search by Including the Author's Writing Style." In BIR@ECIR, pp. 93-100. 2017.


You can access the Scientific Author's Writing Style Corpus 2017 on Zenodo.

If you use the dataset in your research, please send us a copy of your publication. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.


The corpus was created during a pilot study to capture humans' behavior when identifying authorship of text snippets. All snippets are the first sentences until the sentence ending after the 400-th character of the respective scientific article. The four target snippets were selected to have a similar topical distance to the respective source snippet. This topical distance was measured in terms of the cosine similarity of the snippets' word vectors. Therefore, the corpus allows to study authorship tasks without topical hints.


  • Andi Rexha
  • Mark Kröll
  • Hermann Ziak
  • Roman Kern