The Webis Query Segmentation Corpus 2010 (Webis-QSeC-10) contains segmentations for 53,437 web queries obtained from Mechanical Turk crowdsourcing. For each query, 10 MTurk workers were asked to segment the query. The corpus represents the distribution of their decisions.
As we currently figure out whether a segmentation competition based on our corpus could be interesting, we publish a sample of 4,850 queries (such that it is about 10% of the then remaining 48,587 queries that should be used as the test set during the competition).
We provide the training set as a single folder in a Zip archive which contains several files. The file "webis-qsec-10-training-set-queries.txt" contains the query strings and a unique ID for each query. The file "webis-qsec-10-training-set-segmentations.txt" contains the crowdsourced segmentations with their number of votes per query ID (see below for an example). The folder "data" contains all the data (n-gram frequencies, PMI values, POS tags, etc.) needed to replicate the evaluation results of our proposed segmentation algorithms. For convenience reasons, the folder "segmentations-of-algorithms" contains the segmentations that our proposed algorithms compute on the training set.
To download the corpus, please use the following link:
(1.9 MB, MD5 sum: cd9a306261a6d884eb109098d75c7e6c)
If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].
The original queries were extracted from the AOL query log, and range from 3 to 10 keywords in length. For each query 10 MTurk workers were asked to segment the query and their decisions are accumulated in the corpus. The examples below demonstrate two different cases.
Sample queries with internal ID (as in "Webis-QSeC-10-training-set-queries.txt"):
- 2315313155 harvard community credit union
- 1858084875 women's cycling tops
Sample segmentations (as in "Webis-QSeC-10-training-set-queries.txt"):
- 2315313155 [(6, 'harvard community credit union'), (2, 'harvard community|credit union'), (1, 'harvard|community|credit union'), (1, 'harvard|community credit union')]
- 1858084875 [(5, "women's|cycling tops"), (2, "women's|cycling|tops"), (2, "women's cycling|tops"), (1, "women's cycling tops")]
Each query has a unique internal ID (e.g., 2315313155 in the first example) and the segmentations file contains the 10 different decisions the MTurk workers made for that query. In the first example 6 workers have all 4 keywords in one segment, 2 workers decided to break after the second word (denoted by a |) etc. Note that in the second example (query ID 1858084875) in the segmentations the apostrophe in the query is escaped by double quotes around the segmentation strings.
For more information on the construction of the dataset see the respective publication.
Students: Christof Bräutigam, Anna Beyer