The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 with an annotation per web page how well the web page can be reduced from the archive. This data aims to be the foundation for other web datasets. The archiving of the web pages makes them reproducible, which we see as a requirement for conducting research on web page analysis tools. A key question in this regard is how well the web pages can be reproduced from the archive with current technology. For this, we had human annotators grade the achieved reproduction on a 5-point scale. Annotations were collected using crowd sourcing and a tailored annotation interface. Each archive contains the files requested by the browser to display a single web page. The web pages were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The process is described in detail in an upcoming publication.
You can find the entire dataset on Zenodo, including archives, screenshots, and extracted HTML. You can use our webis-web-archiver tool to reproduce the web pages from the archives and programmatically access them.
Students: Florian Kneist.