The Webis Query Spelling Corpus 2017 (Webis-QSpell-17) contains 54,772 web queries that were manually spell-checked; for 9,171 queries alternative spelling variants are contained.
As for segmentations of many of the queries (i.e., tagged concepts and phrases), please refer to the companion corpus Webis-QSeC-10.
You can access the Webis-QSpell-17 corpus on Zenodo.
If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to cite the corpus via [bib]. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.
The original queries were extracted from the AOL query log, and range from 3 to 10 keywords in length. Two independent annotators went through all the queries; allowed to use any tool they wanted to support their work (e.g., Hunspell, aspell, search engines, dictionaries, Wikipedia). For each query, potential alternative spellings (also possibly more than one) had to be annotated. Both annotators then discussed the cases where they disagreed. This typically resulted in different reasonable spelling variants being fed into the final corpus. After this step, three annotators each independently checked one third of the queries that contained alternative spellings from the first iteration and could further add or remove variants if need be---also using tools of their choice.
The two example queries below show the corpus format, different columns separated by semicolons:
- 4030033927;new york and company;new york & company;new york and company
- 3431465218;new york aquarium;new york aquarium;
Each query has a unique internal ID (e.g., 4030033927 in the first example); queries that are also contained in the the Webis-QSeC-10 have the same IDs in both corpora. The original query spelling is in the second column, spelling variants annotated by our annotators are contained in the following column(s). In the first example, two spelling variants are given in the third and fourth column, while in the second example only one spelling variant is given. In the second example, the spelling variant in the third column is identical to the original query in the second column which indicates a case without spelling error.
For more information on the construction of the dataset see the respective publication.
Students: Marcel Gohsen, Anja Rathgeber