Data

Lightning IR provides four different datasets for different tasks: The the DocDataset for indexing, QueryDataset for retrieval, the TupleDataset for fine-tuning, and the RunDataset for re-ranking and fine-tuning. The Datasets sections provide more in-depth information about how each dataset works and what data it provides. To handle batching and train/validation/test splitting, the datasets should be integrated into a LightningIRDataModule. See the Datamodule section for further details.

By tightly integrating with ir-datasets, Lightning IR provides easy access to a plethora of popular information retrieval datasets. Simply pass an ir-datasets id to a dataset class. Custom local datasets are also supported. See howto-dataset for using custom datasets.

Datasets

Doc Dataset

A DocDataset provides access to a set of documents. This is useful for indexing with a BiEncoderModel where the embeddings of each document are stored in an index that can be used for retrieval. The snippet below demonstrates how to use a DocDataset with an ir-datasets dataset.

from lightning_ir import DocDataset

dataset = DocDataset("msmarco-passage")

print(next(iter(dataset)))
# DocSample(doc_id='0', doc='The presence of communication amid ...')

Query Dataset

A QueryDataset provides access to a set of queries. This is useful for retrieval with a BiEncoderModel where the top-k documents are retrieved for each query. The snippet below demonstrates how to use a QueryDataset with an ir-datasets dataset.

from lightning_ir import QueryDataset

dataset = QueryDataset("msmarco-passage/trec-dl-2019/judged")

print(next(iter(dataset)))
# QuerySample(query_id='156493', query='do goldfish grow')

Tuple Dataset

A QueryDataset provides access to samples consisting of a query and an n-tuple of documents, with each document in a sample also having a corresponding target score. Target scores are relevance assessments and, for example, could have been heuristically sampled, manually assessed, derived from other ranking models for distillation. A QueryDataset dataset is useful for fine-tuning BiEncoderModel and CrossEncoderModel. The snippet below demonstrates how to use a TupleDataset with an ir-datasets dataset.

from lightning_ir import TupleDataset

dataset = TupleDataset("msmarco-passage/train/triples-small")

print(next(iter(dataset)))
# RankSample(
#   query_id='400296',
#   query='is a little caffeine ok during pregnancy',
#   doc_ids=('1540783', '3518497'),
#   docs=(
#       'We don’t know a lot about the effects of caffeine ...',
#       'It is generally safe for pregnant women to eat chocolate ...'
#   ),
#   targets=tensor([1., 0.]),
#   qrels=None
# )

Run Dataset

A RunDataset provides access to a run. A run consists of samples of a query and a list of documents ranked by a relevance score. The dataset may include manual relevance assessments (qrels) which are used to evaluate the effectiveness of retrieval models. This dataset is useful for re-ranking with a CrossEncoderModel. It can also be used for fine-tuning BiEncoderModel and CrossEncoderModel by sampling tuples from the run. The snippet below demonstrates how to use a RunDataset with an ir-datasets dataset.

from lightning_ir import RunDataset

dataset = RunDataset("msmarco-passage/trec-dl-2019/judged", depth=5)

print(next(iter(dataset)))
# RankSample(
#   query_id='1037798',
#   query='who is robert gray',
#   doc_ids=('7134595', '7134596', ...),
#   docs=(
#       'Yellow: combines with blue, lilac, light-cyan, ...',
#       'Robert Plant Net Worth is $170 Million ... ',
#       ...
#   ),
#   targets=None,
#   qrels=[
#       {'query_id': '1037798', 'doc_id': '1085628', 'iteration': 'Q0', 'relevance': 0},
#        ...
#   ]
# )

Datamodule

A LightningIRDataModule conveniently handles the batching and splitting logic necessary to ensure effecient fine-tuning and inference. Depending on the stage (see the Trainer section for details on stages), different combinations of datasets can or must be provided to a datamodule. For fine-tuning, a single training_dataset in the form of a TupleDataset or RunDataset must be provided and optionally multiple inference_datasets in the form of TupleDataset or RunDataset can be provided for validation during fine-tuning. For indexing, one or multiple inference_datasets must be provided in the form of DocDataset. For searching, one or multiple inference_datasets must be provided in the form of QueryDataset. For re-ranking, one or multiple inference_datasets must be provided in the form of RunDataset. The snippet below demonstrates how to use a LightningIRDataModule for fine-tuning with validation using ir-datasets datasets.

from lightning_ir import LightningIRDataModule, RunDataset, TupleDataset

train_dataset = TupleDataset("msmarco-passage/train/triples-small")
inference_dataset = RunDataset("msmarco-passage/trec-dl-2019/judged", depth=2)
datamodule = LightningIRDataModule(
    train_dataset=train_dataset,
    train_batch_size=2,
    inference_datasets=[inference_dataset],
    inference_batch_size=2,
)
datamodule.setup("fit")
train_dataloader = datamodule.train_dataloader()
print(next(iter(train_dataloader)))
# TrainBatch(
#   queries=[
#     "is a little caffeine ok during pregnancy",
#     "what fruit is native to australia",
#   ],
#   docs=[
#     (
#       "We don’t know a lot about the effects of caffeine during ...",
#       "It is generally safe for pregnant women to eat chocolate ...",
#     ),
#     (
#       "Passiflora herbertiana. A rare passion fruit native to ...",
#       "The kola nut is the fruit of the kola tree, a genus ...",
#     ),
#   ],
#   query_ids=["400296", "662731"],
#   doc_ids=[("1540783", "3518497"), ("193249", "2975302")],
#   qrels=None,
#   targets=tensor([1.0, 0.0, 1.0, 0.0]),
# )
inference_dataloader = datamodule.inference_dataloader()[0]
print(next(iter(inference_dataloader)))
# RankBatch(
#   queries=["who is robert gray", "cost of interior concrete flooring"],
#   docs=[
#     (
#       "Yellow: combines with blue, lilac, light-cyan, ...",
#       "Salad green: combines with brown, yellowish-brown, ...",
#     ),
#     (
#       "WHAT'S THE DIFFERENCE CONCRETE SLAB VS CONCRETE FLOOR? ...",
#       "If you're trying to figure out what the cost of a concrete ...",
#     ),
#   ],
#   query_ids=["1037798", "104861"],
#   doc_ids=[("7134595", "7134596"), ("841998", "842002")],
#   qrels=[
#     {"query_id": "1037798", "doc_id": "1085628", "iteration": "Q0", "relevance": 0},
#     {"query_id": "1037798", "doc_id": "1308037", "iteration": "Q0", "relevance": 0},
#     ...
#     {"query_id": "104861", "doc_id": "1017088", "iteration": "Q0", "relevance": 0},
#     {"query_id": "104861", "doc_id": "1017092", "iteration": "Q0", "relevance": 2},
#     ...
#   ],
# )