The Webis TLDR Corpus (2017) consists of approximately 4 Million content-summary pairs extracted for Abstractive Summarization, from the Reddit dataset for the years 2006-2016. This corpus is first of its kind from the social media domain in English and has been created to compensate the lack of variety in the datasets used for abstractive summarization research using deep learning models.
To download the corpus use the following link:
If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].
Abstractive summarization is the process of generating intuitive summaries by including novel words like we humans do. Deep learning models have provided encouraging results for generating summaries to relatively short texts. However, the most commonly used corpora are from the news domain which represent only the formal aspect of written text. There is a need for corpora from more informal domains such as social media, whose poorly structured texts present interesting challenges, encouraging further research into abstractive summarization. We believe that the Webis-TLDR-17 corpus shows how various information sources such as discussion forums and blogs can be leveraged to create suitable datasets for summarization. Following is an example of a content-summary pair from the corpus which can be used for abstractive summarization:
Content I finished Path of Daggers earlier this month and took a break to read I Am Pilgrim ( I HIGHLY recommend ) and I am now ready to start my journey into book 9. One thing , I ' ve forgotten a few plot threads . I can ' t search for them as possible spoilers , so what do I need to know going in ? Oh , and SPOILERS for those not up to here and please , no spoilers for me. I love this series to much for it to be ruined . I know Faile has been taken but I ' m not sure about the other main characters whereabouts and smaller character plots
Summary where are the characters at and what are they hoping to do ? People refreshing my memory would be greatly appreciated.
For more information on the construction of the dataset see the publication below.