Synopsis

  • Task: Given a social media post, generate its abstractive summary.
  • Input: [data]
  • Submission: [submit]
  • Summaries: [demo]

Task

With the vast amounts of information being generated online, summarization has always been used in empowering users to quickly decide if they need to read lengthy articles. This problem is further exacerbated on social media platforms where users write informally with little concern for style, structure or grammar. In this challenge, participants will specifically generate summaries for social media posts.

For the first time, we have created one of the largest datasets for neural summarization (Webis-TLDR-17 Corpus) by mining Reddit - an online discussion forum which encourages the practice of providing a "tldr- a short, paraphrased summary" by the author of a post. We invite participants to test and develop their neural summarization models on our dataset to generate such TLDRs; as an incentive, we provide a free, extensive qualitative evaluation of your models through dedicated crowdsourcing.

Data

We provide a corpus consisting of approximately 3 Million content-summary pairs mined from Reddit. It is up to the participants to split this into training and validation sets accordingly.

Evaluation

To provide fast approximations of model performance, the public leaderboard will be updated with ROUGE scores. You will be able to self-evaluate your software using the TIRA service. You can find the user guide here. Additionally, a qualitative evaluation will be performed through crowdsourcing. Human annotators will rate each candidate summary according to five linguistic qualities as suggested by the DUC guidelines.

Submission

Introduction

Participants will install / deploy their models in dedicated TIRA virtual machines, so that their runs can be reproduced and so that they can be easily applied to different data (of same format) in the future.

Quick start

Once trained models are ready, participants will upload them to the VM along with any other code necessary. Please note that the models must be able to perform inference on CPU as we do not have GPUs in the VMs. Inside the VMs, the task data is mounted at /media/training-datasets/tldr-generation. You will find both training and validation datasets here. We recommend trying to first run your models to generate summaries on the validation set after connecting through SSH or RDP to your VM (you can find host ports in the web interface, same login as to your VM). If you cannot connect to your VM, please make sure it is powered on: you can check and power on your machine in the web interface.

Next register the shell command to run your system in the web interface, and run it. Note that your VM will not be accessible while your system is running – it will be “sandboxed”, detached from the internet, and after the run the state of the VM before the run will be restored. Your run can then be reviewed and evaluated by the organizers. You can inspect runs on training and validation data yourself.

Note that your system is expected to read the paths to the input and output folders from the command line. When you register the command to run your system, put variables in positions where you expect to see these paths. Thus if your system expects to get the options -i and -o, followed by input and output path respectively, the command you register may look like this:
/home/my-user-name/my-software/run.sh -i $inputDataset -o $outputDir
The actually executed command will then look something like this:
/home/my-user-name/my-software/run.sh \ -i /media/training-datasets/tldr-generation/inlg-19-tldr-generation-validation-dataset-2018-11-05 \ -o /tmp/my-user-name/2019-03-15-10-11-19/output

You can also directly run python scripts instead of a bash script, after registering your working directory in the web interface. The $inputDataset variable can be set using the dropdown in the web interface when you click on Add Software.

Scoring

When you have successfully tested your system on the validation data, run it on the test data: inlg-19-tldr-generation-test-dataset-2018-11-05
This is only possible through the web interface. Once the run of your system completes, please also run the evaluator on the output of your system. These are two separate actions and both should be invoked through the web interface of TIRA. You don’t have to install the evaluator in your VM. It is already prepared in TIRA. You should see it in the web interface, under your software, labeled “Evaluator”. Before clicking the “Run” button, you will use a drop-down menu to select the “Input run”, i.e. one of the completed runs of your system. The output files from the selected run will be evaluated.

You will see neither the files your system outputs, nor your STDOUT or STDERR. In the evaluator run you will see STDERR, which will tell you if one or more of your output files is not valid. If you think something went wrong with your run, send us an e-mail. We can unblind your STDOUT and STDERR on demand, after we check that you did not leak the test data in the output.

You can register more than one system (“software/ model”) per virtual machine using the web interface. TIRA gives systems automatic names “Software 1”, “Software 2” etc. You can perform several runs per system. We will score only the latest evaluator runs for all your systems.

NOTE: By submitting your software you retain full copyrights. You agree to grant us usage rights for evaluation of the corresponding data generated by your models. We agree not to share your model with a third party or use it for any purpose other than research. The generated summaries will however be shared with a crowdsourcing platform for evaluation.

Results

Automatic metric - ROUGE (scores rounded off)

Models are grouped by participant.
Model R-1 R-2 R-LCS Average length
unified-pgn 19 4 15 33.5
unified-vae-pgn 19 4 15 32.8
transf-seq2seq 19 5 14 14.5
pseudo-self-attn 18 4 13 12.1
tldr-bottom-up 20 4 15 37.3

Task Committee