The World Wide Web is the single largest repository of digital culture and knowledge. This project focuses on the analysis of this invaluable resource through an 8 PB dataset of both current and historical web content taken from the Internet Archive's web archive.
As part of the project, we utilize large scale cluster infrastructure for both storage (Deltaweb) and processing (Betaweb, Gammaweb), as well as virtual workspaces to explore that data. This project is funded by the German Federal Ministry of Education and Research (BMBF) as part of the Immersive Web Observatory project. Partners: Prof. M. Hagen (Halle University), Jun.-Prof. M. Potthast (Leipzig University), Prof. B. Fröhlich, and Prof. B. Stein (Bauhaus-Universität Weimar). See the official announcement (in German) of the Bauhaus-Universität here. The download of the data is currently in progress.
We are interested in joint research and partnerships on this data. Please contact us for ways to get access.
ResearchThe web archive is and has been part of several strands of research:
Analysis on how text is reused in the web.
Web Archive Quality
Assessment of and improvements to the reproduction quality of web archives.
- Janek Bevendorff
- Maik Fröbe
- Matthias Hagen
- Johannes Kiesel
- Kevin Lang
- Martin Potthast
- Benno Stein
- Michael Völske
Students: Milad Alshomary, Fabienne Hubricht, Florian Kneist, Kai Lorenz, Lars Meyer