Synopsis

The Webis Abstractive Snippet Corpus 2020 (Webis-Snippet-20) comprises four abstractive snippet dataset from ClueWeb09, Clueweb12, and DMOZ descriptions. More than 10 million <webpage, abstractive snippet> pairs / 3.5 million <query, webpage, abstractive snippet> pairs were collected.

Download

You can access the Webis-Snippet-20 corpus on Zenodo.

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to cite the corpus via [bib]. If you additionally want to link to the dataset, please use the dataset's doi for a stable link.

Research

An abstractive snippet is an originally created piece of text to summarize a web page on a search engine results page. Compared to the conventional extractive snippets, which are generated by extracting phrases and sentences verbatim from a web page, abstractive snippets circumvent copyright issues; even more interesting is the fact, that they open the door for personalization. Abstractive snippets have been evaluated as equally powerful in terms of user acceptance and expressiveness—but the key question remains: Can abstractive snippets be automatically generated with sufficient quality? We introduces a new approach to abstractive snippet generation: We identify the first two large-scale sources for distant supervision, namely web directories and anchor contexts. By utilizing the DMOZ Open Directory Project and by mining the entire ClueWeb09 and ClueWeb12 for anchor contexts, we compile the Webis Abstractive Snippet Corpus 2020, comprising more than 3.5 million triples of the form <query, snippet, document> as training examples, where the snippet is either an anchor context or a web directory description in lieu of a genuine query-biased abstractive snippet of the web document. demo

People

Publications