The Webis Clickbait Corpus 2017 (Webis-Clickbait-17) comprises a total of 38,517 Twitter posts from 27 major US news publishers. In addition to the posts, information about the articles linked in the posts are included. The posts had been published between November 2016 and June 2017. To avoid publisher and topical biases, a maximum of ten posts per day and publisher were sampled. All posts were annotated on a 4-point scale [not click baiting (0.0), slightly click baiting (0.33), considerably click baiting (0.66), heavily click baiting (1.0)] by five annotators from Amazon Mechanical Turk. A total of 9,276 posts are considered clickbait by the majority of annotators. In terms of its size, this corpus outranges the Webis Clickbait Corpus 2016 by one order of magnitude. The corpus is divided into two logical parts, a training and a test dataset. The training dataset has been released in the course of the Clickbait Challenge and a download link is provided below. To allow for an objective evaulatuion of clickbait detection systems, the test dataset is available only through the Evaluation-as-a-Service platform TIRA at the moment. On TIRA, developers can deploy clickbait detection systems and execute them against the test dataset. The performance of the submitted systems can be viewed on the TIRA page of the Clickbait Challenge.
You can access the Webis-Clickbait-17 corpus on Zenodo.
- clickbait17-train-170630.zip (894 MiB, 19538 posts)
- clickbait17-train-170331.zip (157 MiB, 2459 posts)
- clickbait17-unlabeled-170720.zip (3.3 GiB, 80013 posts, without truth.jsonl)
- clickbait17-test-170720.zip (892 MiB, 18979 posts)
Each corpus zip file comprises the following resources:
instances.jsonl: A line delimited JSON file (JSON Lines). Each line is a JSON-Object containing the information extracted for a specific post and its target article. Have a look at the dataset schema file for an overview of the available fields.
truth.jsonl: A line delimited JSON file. Each line is a JSON-Object containing the crowdsourced clickbait judgements of a specific post. Have a look at the dataset schema file for an overview of the available fields.
media/: A folder that contains all the images referenced in the instances.jsonl file.
In addition to the corpus, we provide the original WARC archives of the articles that are linked in the posts:
- archives-clickbait17-train-170630.zip (94.3 GiB)
The WARC archives may be used to redo the content analysis that we performed to provide information about the linked articles in the corpus. Each article archive zip file comprises a folder for each post in the respective corpus zip file. The folders are labeled with the id of the post they refer to. To avoid a single gigantic folder, each article folder is put into a parent folder according to the last two digits of its name. Each article folder contains:
original-url.txtA text file that contains the url of the link that was stated in the respective post.
id.warcThe WARC archive recorded when requesting the link in the respective post.
id-live.pngA screenshot of the archived article.
url_id.htmlThe html file of the article that is contained, together with other resources, also in the WARC. Note that this file might differ from the version in the WARC archive.
Clickbait refers to a certain kind of web content advertisement that is designed to entice its readers into clicking an accompanying link. Typically, it is spread on social media in the form of short teaser messages that may read like the following examples:
- A Man Falls Down And Cries For Help Twice. The Second Time, My Jaw Drops
- 9 Out Of 10 Americans Are Completely Wrong About This Mind-Blowing Fact
- Here’s What Actually Reduces Gun Violence
When reading such and similar messages, many get the distinct impression that something is odd about them; something unnamed is referred to, some emotional reaction is promised, some lack of knowledge is ascribed, some authority is claimed. Content publishers of all kinds discovered clickbait as an effective tool to draw attention to their websites. The level of attention captured by a website determines the prize of displaying ads there, whereas attention is measured in terms of unique page impressions, usually caused by clicking on a link that points to a given page (often abbreviated as “clicks”). Therefore, a clickbait’s target link alongside its teaser message usually redirects to the sender’s website if the reader is afar, or else to another page on the same site. The content found at the linked page often encourages the reader to share it, suggesting clickbait for a default message and thus spreading it virally. Clickbait on social media has been on the rise in recent years, and even some news publishers have adopted this technique. These developments have caused general concern among many outspoken bloggers, since clickbait threatens to clog up social media channels, and since it violates journalistic codes of ethics.
Students: Kristof Komlossy, Sebstian Schuster, Erika P. Garces Fernandez.