Data
Lightning IR provides four different datasets for different tasks: The the DocDataset
for indexing, QueryDataset
for retrieval, the TupleDataset
for fine-tuning, and the RunDataset
for re-ranking and fine-tuning. The Datasets sections provide more in-depth information about how each dataset works and what data it provides. To handle batching and train/validation/test splitting, the datasets should be integrated into a LightningIRDataModule
. See the Datamodule section for further details.
By tightly integrating with ir-datasets, Lightning IR provides easy access to a plethora of popular information retrieval datasets. Simply pass an ir-datasets
id to a dataset class. Custom local datasets are also supported. See howto-dataset for using custom datasets.
Datasets
Doc Dataset
A DocDataset
provides access to a set of documents. This is useful for indexing with a BiEncoderModel
where the embeddings of each document are stored in an index that can be used for retrieval. The snippet below demonstrates how to use a DocDataset
with an ir-datasets dataset.
from lightning_ir import DocDataset
dataset = DocDataset("msmarco-passage")
print(next(iter(dataset)))
# DocSample(doc_id='0', doc='The presence of communication amid ...')
Query Dataset
A QueryDataset
provides access to a set of queries. This is useful for retrieval with a BiEncoderModel
where the top-k documents are retrieved for each query. The snippet below demonstrates how to use a QueryDataset
with an ir-datasets dataset.
from lightning_ir import QueryDataset
dataset = QueryDataset("msmarco-passage/trec-dl-2019/judged")
print(next(iter(dataset)))
# QuerySample(query_id='156493', query='do goldfish grow')
Tuple Dataset
A QueryDataset
provides access to samples consisting of a query and an n-tuple of documents, with each document in a sample also having a corresponding target score. Target scores are relevance assessments and, for example, could have been heuristically sampled, manually assessed, derived from other ranking models for distillation. A QueryDataset
dataset is useful for fine-tuning BiEncoderModel
and CrossEncoderModel
. The snippet below demonstrates how to use a TupleDataset
with an ir-datasets dataset.
from lightning_ir import TupleDataset
dataset = TupleDataset("msmarco-passage/train/triples-small")
print(next(iter(dataset)))
# RankSample(
# query_id='400296',
# query='is a little caffeine ok during pregnancy',
# doc_ids=('1540783', '3518497'),
# docs=(
# 'We don’t know a lot about the effects of caffeine ...',
# 'It is generally safe for pregnant women to eat chocolate ...'
# ),
# targets=tensor([1., 0.]),
# qrels=None
# )
Run Dataset
A RunDataset
provides access to a run. A run consists of samples of a query and a list of documents ranked by a relevance score. The dataset may include manual relevance assessments (qrels) which are used to evaluate the effectiveness of retrieval models. This dataset is useful for re-ranking with a CrossEncoderModel
. It can also be used for fine-tuning BiEncoderModel
and CrossEncoderModel
by sampling tuples from the run. The snippet below demonstrates how to use a RunDataset
with an ir-datasets dataset.
from lightning_ir import RunDataset
dataset = RunDataset("msmarco-passage/trec-dl-2019/judged", depth=5)
print(next(iter(dataset)))
# RankSample(
# query_id='1037798',
# query='who is robert gray',
# doc_ids=('7134595', '7134596', ...),
# docs=(
# 'Yellow: combines with blue, lilac, light-cyan, ...',
# 'Robert Plant Net Worth is $170 Million ... ',
# ...
# ),
# targets=None,
# qrels=[
# {'query_id': '1037798', 'doc_id': '1085628', 'iteration': 'Q0', 'relevance': 0},
# ...
# ]
# )
Datamodule
A LightningIRDataModule
conveniently handles the batching and splitting logic necessary to ensure effecient fine-tuning and inference. Depending on the stage (see the Trainer section for details on stages), different combinations of datasets can or must be provided to a datamodule. For fine-tuning, a single training_dataset
in the form of a TupleDataset
or RunDataset
must be provided and optionally multiple inference_datasets
in the form of TupleDataset
or RunDataset
can be provided for validation during fine-tuning. For indexing, one or multiple inference_datasets
must be provided in the form of DocDataset
. For searching, one or multiple inference_datasets
must be provided in the form of QueryDataset
. For re-ranking, one or multiple inference_datasets
must be provided in the form of RunDataset
. The snippet below demonstrates how to use a LightningIRDataModule
for fine-tuning with validation using ir-datasets datasets.
from lightning_ir import LightningIRDataModule, RunDataset, TupleDataset
train_dataset = TupleDataset("msmarco-passage/train/triples-small")
inference_dataset = RunDataset("msmarco-passage/trec-dl-2019/judged", depth=2)
datamodule = LightningIRDataModule(
train_dataset=train_dataset,
train_batch_size=2,
inference_datasets=[inference_dataset],
inference_batch_size=2,
)
datamodule.setup("fit")
train_dataloader = datamodule.train_dataloader()
print(next(iter(train_dataloader)))
# TrainBatch(
# queries=[
# "is a little caffeine ok during pregnancy",
# "what fruit is native to australia",
# ],
# docs=[
# (
# "We don’t know a lot about the effects of caffeine during ...",
# "It is generally safe for pregnant women to eat chocolate ...",
# ),
# (
# "Passiflora herbertiana. A rare passion fruit native to ...",
# "The kola nut is the fruit of the kola tree, a genus ...",
# ),
# ],
# query_ids=["400296", "662731"],
# doc_ids=[("1540783", "3518497"), ("193249", "2975302")],
# qrels=None,
# targets=tensor([1.0, 0.0, 1.0, 0.0]),
# )
inference_dataloader = datamodule.inference_dataloader()[0]
print(next(iter(inference_dataloader)))
# RankBatch(
# queries=["who is robert gray", "cost of interior concrete flooring"],
# docs=[
# (
# "Yellow: combines with blue, lilac, light-cyan, ...",
# "Salad green: combines with brown, yellowish-brown, ...",
# ),
# (
# "WHAT'S THE DIFFERENCE CONCRETE SLAB VS CONCRETE FLOOR? ...",
# "If you're trying to figure out what the cost of a concrete ...",
# ),
# ],
# query_ids=["1037798", "104861"],
# doc_ids=[("7134595", "7134596"), ("841998", "842002")],
# qrels=[
# {"query_id": "1037798", "doc_id": "1085628", "iteration": "Q0", "relevance": 0},
# {"query_id": "1037798", "doc_id": "1308037", "iteration": "Q0", "relevance": 0},
# ...
# {"query_id": "104861", "doc_id": "1017088", "iteration": "Q0", "relevance": 0},
# {"query_id": "104861", "doc_id": "1017092", "iteration": "Q0", "relevance": 2},
# ...
# ],
# )