The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017. The original Webis-Web-Archive-17 dataset (described here) contains the web archive files, HTML DOM, and screenshots of each web page, as well as annotations per web page on how well the web page can be reduced from the archive. Later on, the dataset was extended with annotations of content errors (described here; browse the annotations here).
You can find the entire dataset on Zenodo, including archives, screenshots, and extracted HTML. You can use our webis-web-archiver tool to reproduce the web pages from the archives and programmatically access them.
If you use the dataset in your research, please send us a copy of your publication. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.
Students: Fabienne Hubricht, Florian Kneist.
[June 26th, 2020]
Bauhaus-Universität Weimar: Forscher der Bauhaus-Universität Weimar gewinnt "FAIRest Dataset"-Preis