Quickstart Guide
Lightning IR can either be used programatically or using the CLI. The CLI is based on PyTorch Lightning CLI and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.
After installing Lightning IR, the CLI is accessible via the lightning-ir
command and provides commands for fine-tuning, indexing, searching, and re-ranking.
$ lightning-ir --help
...
Available subcommands:
fit Runs the full optimization routine.
index Index a collection of documents.
search Search for relevant documents.
re_rank Re-rank a set of retrieved documents.
The behavior of the CLI is most easily controlled using YAML configuration files which specify the model, data, and trainer settings.
Example
The following sections provide a step-by-step example of how to fine-tune a bi-encoder or a cross-encoder model on the MS MARCO passage ranking dataset, index the documents, search for relevant documents, and re-rank these documents. If you are only interested in inference and not fine-tuning, you can skip the Fine-Tuning step and directly jump to the Indexing, Searching, or Re-Ranking steps using already fine-tuned models from the Model Zoo.
Fine-Tuning
To fine-tune a model you need to define the model module (either a BiEncoderModule
or a CrossEncoderModule
), the LightningIRDataModule
(which has either a TupleDataset
or RunDataset
training dataset), and the LightningIRTrainer
settings.
The following command and configuration file demonstrates how to fine-tune a bi-encoder (or cross-encoder) on the MS MARCO passage ranking dataset using the CLI.
lightning-ir fit --config fine-tune.yaml
fine-tune.yaml
trainer:
max_steps: 100_000
model:
class_path: lightning_ir.BiEncoderModule
# class_path: lightning_ir.CrossEncoderModule
init_args:
model_name_or_path: bert-base-uncased
config:
class_path: lightning_ir.BiEncoderConfig
# class_path: lightning_ir.CrossEncoderConfig
loss_functions:
- lightning_ir.RankNet
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
train_dataset:
class_path: lightning_ir.TupleDataset
init_args:
tuples_dataset: msmarco-passage/train/triples-small
train_batch_size: 32
optimizer:
class_path: torch.optim.AdamW
init_args:
lr: 0.001
The following script demonstrates how to do the same but programatically.
fine_tune.py
from torch.optim import AdamW
from lightning_ir import (
BiEncoderConfig,
BiEncoderModule,
LightningIRDataModule,
LightningIRTrainer,
RankNet,
TupleDataset,
)
# Define the model
module = BiEncoderModule(
model_name_or_path="bert-base-uncased", # backbone model
config=BiEncoderConfig(),
loss_functions=[RankNet()], # or other loss functions
)
# or
# module = CrossEncoderModule(
# model_name_or_path="bert-base-uncased", # backbone model
# config=CrossEncoderConfig()
# loss_functions=[RankNet()] # or other loss functions
# )
module.set_optimizer(AdamW, lr=1e-5)
# Define the data module
data_module = LightningIRDataModule(
train_dataset=TupleDataset("msmarco-passage/train/triples-small"),
train_batch_size=32,
)
# Define the trainer
trainer = LightningIRTrainer(max_steps=100_000)
# Fine-tune the model
trainer.fit(module, data_module)
Indexing
For indexing, you need an already fine-tuned BiEncoderModel
. See the Model Zoo for examples. Depending on the bi-encoder model type, you need to select the appropriate IndexConfig
to pass to the IndexCallback
. In addition, you need to specify the DocDataset
to index. The model module, data module, and indexing callback are then passed to the trainer to run the indexing.
The following command and configuration file demonstrate how to index the MS MARCO passage ranking dataset using an already fine-tuned bi-encoder and faiss.
lightning-ir index --config index.yaml
index.yaml
trainer:
callbacks:
- class_path: lightning_ir.IndexCallback
init_args:
index_dir: ./msmarco-passage-index
index_config:
class_path: lightning_ir.FaissFlatIndexConfig
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: webis/bert-bi-encoder
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.DocDataset
init_args:
doc_dataset: msmarco-passage
inference_batch_size: 256
The following script demonstrates how to do the same but programatically.
index.py
from lightning_ir import (
BiEncoderModule,
DocDataset,
FaissFlatIndexConfig,
IndexCallback,
LightningIRDataModule,
LightningIRTrainer,
)
# Define the model
module = BiEncoderModule(
model_name_or_path="webis/bert-bi-encoder",
)
# Define the data module
data_module = LightningIRDataModule(
inference_datasets=[DocDataset("msmarco-passage")],
inference_batch_size=256,
)
# Define the index callback
callback = IndexCallback(
index_dir="./msmarco-passage-index",
index_config=FaissFlatIndexConfig(),
)
# Define the trainer
trainer = LightningIRTrainer(callbacks=[callback])
# Index the data
trainer.index(module, data_module)
Searching
For searching, you need an already fine-tuned BiEncoderModel
. See the Model Zoo for examples. Additionally, you must have created an index using the Indexing step. The search is performed using the SearchCallback
which requires a SearchConfig
that corresponds to the IndexConfig
used during indexing. The data module must receive a QueryDataset
to iterate over a set of queries. The model module, data module, and searching callback are then passed to the trainer to run searching. If the dataset has relevance judgements and a set of evaluation metrics are passed to the model, the trainer will report effectiveness metrics.
The following command and configuration file demonstrate how to retrieve the top-100 passages for each query from the TREC Deep Learning 2019 and 2020 tracks. After searching, the results are saved in a run file and the effectiveness is reported using nDCG@10.
lightning-ir search --config search.yaml
search.yaml
trainer:
callbacks:
- class_path: lightning_ir.SearchCallback
init_args:
index_dir: ./msmarco-passage-index
index_config:
class_path: lightning_ir.FaissSearchConfig
init_args:
k: 100
save_dir: ./runs
model:
class_path: lightning_ir.BiEncoderModule
init_args:
model_name_or_path: webis/bert-bi-encoder
evaluation_metrics:
- nDCG@10
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.QueryDataset
init_args:
doc_dataset: msmarco-passage/trec-dl-2019/judged
- class_path: lightning_ir.QueryDataset
init_args:
doc_dataset: msmarco-passage/trec-dl-2020/judged
inference_batch_size: 4
The following script demonstrates how to do the same but programatically.
search.py
from lightning_ir import (
BiEncoderModule,
FaissSearchConfig,
LightningIRDataModule,
LightningIRTrainer,
QueryDataset,
SearchCallback,
)
# Define the model
module = BiEncoderModule(
model_name_or_path="webis/bert-bi-encoder",
evaluation_metrics=["nDCG@10"],
)
# Define the data module
data_module = LightningIRDataModule(
inference_datasets=[
QueryDataset("msmarco-passage/trec-dl-2019/judged"),
QueryDataset("msmarco-passage/trec-dl-2020/judged"),
],
inference_batch_size=4,
)
# Define the search callback
callback = SearchCallback(
index_dir="./msmarco-passage-index",
search_config=FaissSearchConfig(k=100),
save_dir="./runs",
)
# Define the trainer
trainer = LightningIRTrainer(callbacks=[callback])
# Retrieve relevant documents
trainer.search(module, data_module)
Re-Ranking
For re-ranking, you need an already fine-tuned BiEncoderModel
or CrossEncoderModel
(the latter are usually more effective). The data module must receive a RunDataset
which loads the run file to re-rank. To save the re-ranked file you need to specify a ReRankCallback
. The model module, data module, and re-ranking callback are then passed to the trainer to run re-ranking. If the dataset has relevance judgements and a set of evaluation metrics are passed to the model, the trainer will report effectiveness metrics.
The following command and configuration file demonstrate how to re-rank the top-100 passages for each query from the TREC Deep Learning 2019 and 2020 tracks using a cross-encoder. After re-ranking, the results are saved in a run file and the effectiveness is reported using nDCG@10.
lightning-ir re_rank --config re-rank.yaml
re-rank.yaml
trainer:
callbacks:
- class_path: lightning_ir.ReRankCallback
init_args:
save_dir: ./re-ranked-runs
model:
class_path: lightning_ir.CrossEncoderModule
init_args:
model_name_or_path: webis/monoelectra-base
evaluation_metrics:
- nDCG@10
data:
class_path: lightning_ir.LightningIRDataModule
init_args:
inference_datasets:
- class_path: lightning_ir.RunDataset
init_args:
run_path_or_id: ./runs/msmarco-passage-trec-dl-2019-judged.run
- class_path: lightning_ir.RunDataset
init_args:
run_path_or_id: ./runs/msmarco-passage-trec-dl-2020-judged.run
inference_batch_size: 4
The following script demonstrates how to do the same but programatically.
re_rank.py
from lightning_ir import CrossEncoderModule, LightningIRDataModule, LightningIRTrainer, ReRankCallback, RunDataset
# Define the model
module = CrossEncoderModule(
model_name_or_path="webis/monoelectra-base",
evaluation_metrics=["nDCG@10"],
)
# Define the data module
data_module = LightningIRDataModule(
inference_datasets=[
RunDataset("./runs/msmarco-passage-trec-dl-2019-judged.run"),
RunDataset("./runs/msmarco-passage-trec-dl-2020-judged.run"),
],
inference_batch_size=4,
)
# Define the search callback
callback = ReRankCallback(save_dir="./re-ranked-runs")
# Define the trainer
trainer = LightningIRTrainer(callbacks=[callback])
# Retrieve relevant documents
trainer.re_rank(module, data_module)