Webis-Web-Archive-17

Synopsis

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017. The original Webis-Web-Archive-17 dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as annotations per web page on how well the web page can be reduced from the archive. Later on, the dataset was extended with annotations of content errors.

Download

To download the corpus use the following links:

People

Publications