Quickstart Guide

Lightning IR can either be used programatically or using the CLI. The CLI is based on PyTorch Lightning CLI and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.

After installing Lightning IR, the CLI is accessible via the lightning-ir command and provides commands for fine-tuning, indexing, searching, and re-ranking.

$ lightning-ir --help

...

Available subcommands:
  fit                 Runs the full optimization routine.
  index               Index a collection of documents.
  search              Search for relevant documents.
  re_rank             Re-rank a set of retrieved documents.

The behavior of the CLI is most easily controlled using YAML configuration files which specify the model, data, and trainer settings.

Example

The following sections provide a step-by-step example of how to fine-tune a bi-encoder or a cross-encoder model on the MS MARCO passage ranking dataset, index the documents, search for relevant documents, and re-rank these documents. If you are only interested in inference and not fine-tuning, you can skip the Fine-Tuning step and directly jump to the Indexing, Searching, or Re-Ranking steps using already fine-tuned models from the Model Zoo.

Fine-Tuning

To fine-tune a model you need to define the model module (either a BiEncoderModule or a CrossEncoderModule), the LightningIRDataModule (which has either a TupleDataset or RunDataset training dataset), and the LightningIRTrainer settings.

The following command and configuration file demonstrates how to fine-tune a bi-encoder (or cross-encoder) on the MS MARCO passage ranking dataset using the CLI.

lightning-ir fit --config fine-tune.yaml
fine-tune.yaml
trainer:
  max_steps: 100_000
model:
  class_path: lightning_ir.BiEncoderModule
  # class_path: lightning_ir.CrossEncoderModule
  init_args:
    model_name_or_path: bert-base-uncased
    config:
      class_path: lightning_ir.BiEncoderConfig
      # class_path: lightning_ir.CrossEncoderConfig
    loss_functions:
    - lightning_ir.RankNet
data:
  class_path: lightning_ir.LightningIRDataModule
  init_args:
    train_dataset:
      class_path: lightning_ir.TupleDataset
      init_args:
        tuples_dataset: msmarco-passage/train/triples-small
    train_batch_size: 32
optimizer:
  class_path: torch.optim.AdamW
  init_args:
    lr: 0.001

The following script demonstrates how to do the same but programatically.

fine_tune.py
from torch.optim import AdamW

from lightning_ir import (
    BiEncoderConfig,
    BiEncoderModule,
    LightningIRDataModule,
    LightningIRTrainer,
    RankNet,
    TupleDataset,
)

# Define the model
module = BiEncoderModule(
    model_name_or_path="bert-base-uncased",  # backbone model
    config=BiEncoderConfig(),
    loss_functions=[RankNet()],  # or other loss functions
)
# or
# module = CrossEncoderModule(
#    model_name_or_path="bert-base-uncased", # backbone model
#    config=CrossEncoderConfig()
#    loss_functions=[RankNet()] # or other loss functions
# )
module.set_optimizer(AdamW, lr=1e-5)

# Define the data module
data_module = LightningIRDataModule(
    train_dataset=TupleDataset("msmarco-passage/train/triples-small"),
    train_batch_size=32,
)

# Define the trainer
trainer = LightningIRTrainer(max_steps=100_000)

# Fine-tune the model
trainer.fit(module, data_module)

Indexing

For indexing, you need an already fine-tuned BiEncoderModel. See the Model Zoo for examples. Depending on the bi-encoder model type, you need to select the appropriate IndexConfig to pass to the IndexCallback. In addition, you need to specify the DocDataset to index. The model module, data module, and indexing callback are then passed to the trainer to run the indexing.

The following command and configuration file demonstrate how to index the MS MARCO passage ranking dataset using an already fine-tuned bi-encoder and faiss.

lightning-ir index --config index.yaml
index.yaml
trainer:
  callbacks:
  - class_path: lightning_ir.IndexCallback
    init_args:
      index_dir: ./msmarco-passage-index
      index_config:
        class_path: lightning_ir.FaissFlatIndexConfig
model:
  class_path: lightning_ir.BiEncoderModule
  init_args:
    model_name_or_path: webis/bert-bi-encoder
data:
  class_path: lightning_ir.LightningIRDataModule
  init_args:
    inference_datasets:
    - class_path: lightning_ir.DocDataset
      init_args:
        doc_dataset: msmarco-passage
    inference_batch_size: 256

The following script demonstrates how to do the same but programatically.

index.py
from lightning_ir import (
    BiEncoderModule,
    DocDataset,
    FaissFlatIndexConfig,
    IndexCallback,
    LightningIRDataModule,
    LightningIRTrainer,
)

# Define the model
module = BiEncoderModule(
    model_name_or_path="webis/bert-bi-encoder",
)

# Define the data module
data_module = LightningIRDataModule(
    inference_datasets=[DocDataset("msmarco-passage")],
    inference_batch_size=256,
)

# Define the index callback
callback = IndexCallback(
    index_dir="./msmarco-passage-index",
    index_config=FaissFlatIndexConfig(),
)

# Define the trainer
trainer = LightningIRTrainer(callbacks=[callback])

# Index the data
trainer.index(module, data_module)

Searching

For searching, you need an already fine-tuned BiEncoderModel. See the Model Zoo for examples. Additionally, you must have created an index using the Indexing step. The search is performed using the SearchCallback which requires a SearchConfig that corresponds to the IndexConfig used during indexing. The data module must receive a QueryDataset to iterate over a set of queries. The model module, data module, and searching callback are then passed to the trainer to run searching. If the dataset has relevance judgements and a set of evaluation metrics are passed to the model, the trainer will report effectiveness metrics.

The following command and configuration file demonstrate how to retrieve the top-100 passages for each query from the TREC Deep Learning 2019 and 2020 tracks. After searching, the results are saved in a run file and the effectiveness is reported using nDCG@10.

lightning-ir search --config search.yaml
search.yaml
trainer:
  callbacks:
  - class_path: lightning_ir.SearchCallback
    init_args:
      index_dir: ./msmarco-passage-index
      index_config:
        class_path: lightning_ir.FaissSearchConfig
        init_args:
          k: 100
      save_dir: ./runs
model:
  class_path: lightning_ir.BiEncoderModule
  init_args:
    model_name_or_path: webis/bert-bi-encoder
    evaluation_metrics:
    - nDCG@10
data:
  class_path: lightning_ir.LightningIRDataModule
  init_args:
    inference_datasets:
    - class_path: lightning_ir.QueryDataset
      init_args:
        doc_dataset: msmarco-passage/trec-dl-2019/judged
    - class_path: lightning_ir.QueryDataset
      init_args:
        doc_dataset: msmarco-passage/trec-dl-2020/judged
    inference_batch_size: 4

The following script demonstrates how to do the same but programatically.

search.py
from lightning_ir import (
    BiEncoderModule,
    FaissSearchConfig,
    LightningIRDataModule,
    LightningIRTrainer,
    QueryDataset,
    SearchCallback,
)

# Define the model
module = BiEncoderModule(
    model_name_or_path="webis/bert-bi-encoder",
    evaluation_metrics=["nDCG@10"],
)

# Define the data module
data_module = LightningIRDataModule(
    inference_datasets=[
        QueryDataset("msmarco-passage/trec-dl-2019/judged"),
        QueryDataset("msmarco-passage/trec-dl-2020/judged"),
    ],
    inference_batch_size=4,
)

# Define the search callback
callback = SearchCallback(
    index_dir="./msmarco-passage-index",
    search_config=FaissSearchConfig(k=100),
    save_dir="./runs",
)

# Define the trainer
trainer = LightningIRTrainer(callbacks=[callback])

# Retrieve relevant documents
trainer.search(module, data_module)

Re-Ranking

For re-ranking, you need an already fine-tuned BiEncoderModel or CrossEncoderModel (the latter are usually more effective). The data module must receive a RunDataset which loads the run file to re-rank. To save the re-ranked file you need to specify a ReRankCallback. The model module, data module, and re-ranking callback are then passed to the trainer to run re-ranking. If the dataset has relevance judgements and a set of evaluation metrics are passed to the model, the trainer will report effectiveness metrics.

The following command and configuration file demonstrate how to re-rank the top-100 passages for each query from the TREC Deep Learning 2019 and 2020 tracks using a cross-encoder. After re-ranking, the results are saved in a run file and the effectiveness is reported using nDCG@10.

lightning-ir re_rank --config re-rank.yaml
re-rank.yaml
trainer:
  callbacks:
  - class_path: lightning_ir.ReRankCallback
    init_args:
      save_dir: ./re-ranked-runs
model:
  class_path: lightning_ir.CrossEncoderModule
  init_args:
    model_name_or_path: webis/monoelectra-base
    evaluation_metrics:
    - nDCG@10
data:
  class_path: lightning_ir.LightningIRDataModule
  init_args:
    inference_datasets:
    - class_path: lightning_ir.RunDataset
      init_args:
        run_path_or_id: ./runs/msmarco-passage-trec-dl-2019-judged.run
    - class_path: lightning_ir.RunDataset
      init_args:
        run_path_or_id: ./runs/msmarco-passage-trec-dl-2020-judged.run
    inference_batch_size: 4

The following script demonstrates how to do the same but programatically.

re_rank.py
from lightning_ir import CrossEncoderModule, LightningIRDataModule, LightningIRTrainer, ReRankCallback, RunDataset

# Define the model
module = CrossEncoderModule(
    model_name_or_path="webis/monoelectra-base",
    evaluation_metrics=["nDCG@10"],
)

# Define the data module
data_module = LightningIRDataModule(
    inference_datasets=[
        RunDataset("./runs/msmarco-passage-trec-dl-2019-judged.run"),
        RunDataset("./runs/msmarco-passage-trec-dl-2020-judged.run"),
    ],
    inference_batch_size=4,
)

# Define the search callback
callback = ReRankCallback(save_dir="./re-ranked-runs")

# Define the trainer
trainer = LightningIRTrainer(callbacks=[callback])

# Retrieve relevant documents
trainer.re_rank(module, data_module)