Webis-Web-Archive-17

Synopsis

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality.

Access

Please refer to the publications for citing the dataset. If you want to link the dataset, please use the dataset permalink [doi].

  • Browse the dataset here.
  • Download the dataset from Zenodo.
  • Find the related metadata at Google.

People

Publications