Webis-Web-Archive-17

Name: Webis-Web-Archive-17
Published: 2017
License: https://creativecommons.org/licenses/by/4.0/deed.en

Synopsis
People
Publications

Synopsis

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality.

Access

Please refer to this publication for citing the dataset. If you want to link the dataset, please use the dataset permalink [doi].

Browse the dataset here.
Download the dataset from Zenodo.
Find the related metadata at Google.

People

Johannes Kiesel
Martin Potthast
Matthias Hagen
Benno Stein
Florian Kneist

Webis-Web-Archive-17

Synopsis

Access

People

Publications

Args

ChatNoir

IR Anthology

Netspeak

Picapica

TIRA