The World Wide Web is the single largest repository of digital culture and knowledge. This project focuses on the analysis of this invaluable resource through an 8 PB dataset of both current and historical web content taken from the Internet Archive's web archive. Our ongoing and planned large scale data analyses will address selected scientific, social, and ethical challenges of the information society in general, and the web in particular.
As part of the project, we utilize large scale cluster infrastructure for both storage and processing (facilities.webis.de » Hardware), as well as virtual workspaces to explore that data. This project is funded by the German Federal Ministry of Education and Research (BMBF) as part of the Immersive Web Observatory project. Partners: Prof. M. Hagen (Halle University), Jun.-Prof. M. Potthast (Leipzig University), Prof. B. Fröhlich, and Prof. B. Stein (Bauhaus-Universität Weimar). See the official announcement (in German) of the Bauhaus-Universität here. The download of the data is currently in progress.
We are interested in joint research and partnerships on this data. Please contact us for ways to get access.
- Janek Bevendorff
- Maik Fröbe
- Matthias Hagen
- Johannes Kiesel
- Kevin Lang
- Lars Meyer
- Martin Potthast
- Benno Stein
- Michael Völske
Students: Milad Alshomary, Fabienne Hubricht, Florian Kneist, Kai Lorenz
[June 26th, 2020]
Bauhaus-Universität Weimar: Forscher der Bauhaus-Universität Weimar gewinnt "FAIRest Dataset"-Preis