This corpus is an extension of the Ambient data set created by Carpineto and Romano. For each subtopic, the websites of the given URLs were downloaded (if accessible). Those documents are named as the original documents, for example, 1/1.4/1.3.html. Each subtopic was then manually enriched to ten documents with websites retrieved by Google (for example, 1/1.1/g00.html - 'g' for Google, 00 for the first Google result). Some subtopics could not be sufficently enriched and were discarded. Moreover, some subtopics were duplicates or not interpretable and were also discarded.
You can access the Webis-Ambient-15 corpus on Zenodo.
The data sets consists of 44 topics (topics.txt) and 481 subtopics (subtopics.txt). Some subtopics are topically very similar and therefore rather difficult to be clustered. These subtopics (11.2, 12.13, 14.2, 19.33, 20.2, 20.5, 21.2, 24.3, 24.4, 27.26, 31.16, 36.7, 44.9) are discarded in the file subtopics-filtered.txt, which lists only the remaining 468 subtopics.
If you use the dataset in your research, please send us a copy of your publication. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.
For more information on the dataset see the publication below.
Students: Matthias Busse.