Web Archive


The World Wide Web is the single largest repository of digital culture and knowledge. This project focuses on the analysis of this invaluable resource through an 8 PB dataset of both current and historical web content taken from the Internet Archive's web archive.

As part of the project, we utilize large scale cluster infrastructure for both storage (Deltaweb) and processing (Betaweb, Gammaweb), as well as virtual workspaces to explore that data. This project is funded by the German Federal Ministry of Education and Research (BMBF) as part of the Immersive Web Observatory project. Partners: Prof. M. Hagen (Halle University), Jun.-Prof. M. Potthast (Leipzig University), Prof. B. Fröhlich, and Prof. B. Stein (Bauhaus-Universität Weimar). See the official announcement (in German) of the Bauhaus-Universität here. The download of the data is currently in progress.

We are interested in joint research and partnerships on this data. Please contact us for ways to get access.


The web archive is and has been part of several strands of research:
  • Text Reuse
    Analysis on how text is reused in the web.

  • Web Archive Quality
    Assessment of and improvements to the reproduction quality of web archives.


Students: Milad Alshomary, Fabienne Hubricht, Florian Kneist, Kai Lorenz, Lars Meyer