Webis-Web-Archive-17
Synopsis
The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality.
Access
Please refer to this publication for citing the dataset. If you want to link the dataset, please use the dataset permalink [doi].
People
- Johannes Kiesel
- Martin Potthast
- Matthias Hagen
- Benno Stein
- Florian Kneist
Publications