Webis MS MARCO Anchor Text 2022
Synopsis
The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with up to 1,000 anchor texts each.
Access
Please refer to this publication for citing the dataset. If you want to link the dataset, please use the dataset permalink [doi].
- Use the dataset in Huggingface Datasets.
- Download the dataset from Zenodo.
People
- Maik Fröbe
- Sebastian Günther
- Maximilian Probst
- Martin Potthast
- Matthias Hagen
Publications