Model

At a high-level, a LightningIRModel is a wrapper around a pre-trained Hugging Face model that specifies how the embeddings generated by that pre-trained model should be aggregated to compute a relevance score between a query and a document. The behavior of the model is dictated by a LightningIRConfig. Two types of models are supported: BiEncoderModel and CrossEncoderModel, each with their correpsonding BiEncoderConfig and CrossEncoderConfig.

A LightningIRModel is backbone agnostic and can be used with any pre-trained model from the Hugging Face Model Hub. To initialize a new LightningIRModel, select a pre-trained model from the Hugging Face Model Hub, create a LightningIRConfig, and pass both to the from_pretrained() method. Models already fine-tuned using Lightning IR can be loaded directly without specifying a config. See the Model Zoo for a list of pre-trained models.

from lightning_ir import LightningIRModel, BiEncoderConfig

model = LightningIRModel.from_pretrained(
    "bert-base-uncased", config=BiEncoderConfig()
)
print(type(model))
# <class 'lightning_ir.base.class_factory.BiEncoderBertModel'>

model = LightningIRModel.from_pretrained(
    "google/electra-base-discriminator", config=CrossEncoderConfig()
)
print(type(model))
# <class 'lightning_ir.base.class_factory.CrossEncoderElectraModel'>

Bi-encoder models compute a relevance score by embedding the query and document separately and compute the similarity between the two embeddings. A cross-encoder receives both the query and document as input and computes a relevance score based on the joint contextualized embedding. See the Bi-Encoder and Cross-Encoder sections for more details.

The easiest way to use a LightningIRModel is through it’s corresponding LightningIRModule. The module combines a LightningIRModel and LightningIRTokenizer and handles the forward pass of the model. The following example illustrates how to use a bi-encoder or a cross-encoder model to score the relevance between a query and a document. Note the bi-encoder generates two embedding tensors while the cross-encoder generates a single joint embedding tensor.

from lightning_ir import BiEncoderModule, CrossEncoderModule

bi_encoder = BiEncoderModule("webis/bert-bi-encoder")
cross_encoder = CrossEncoderModule("webis/monoelectra-base")

query = "What is the capital of France?"
docs = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
bi_encoder_output = bi_encoder.score(query, docs)
cross_encoder_output = cross_encoder.score(query, docs)

print(bi_encoder_output.scores)
# tensor([38.9621, 29.7557])
print(bi_encoder_output.query_embeddings.embeddings.shape)
# torch.Size([1, 1, 768])
print(bi_encoder_output.doc_embeddings.embeddings.shape)
# torch.Size([2, 1, 768])
print(cross_encoder_output.scores)
# tensor([ 7.7892, -3.5815])
print(cross_encoder_output.embeddings.shape)
# torch.Size([2, 1, 768])

Bi-Encoder

The BiEncoderConfig specifies how the contextualized embeddings of a pre-trained Hugging Face model are further processed and how the relevance score is computed based on the embeddings. The processing pipeline includes four steps which are executed in order and can be separately configured for query and document processing: projection, sparsification, pooling, and normalization. The flexibility of this pipeline allows for configuring a plethora of popular bi-encoder models, from learned sparse models like SPLADE to dense multi-vector models like ColBERT. The following sections go over the pipeline stages in detail and how these can be configured.

Backbone Encoding

First, an input sequence of tokens (e.g., the query or document), are fed through a pre-trained backbone language model from Hugging Face. The model generates contextualized embeddings, one vector for each token in the input sequence, that is passed to the projection step.

Projection

The projection step scales the dimensionality of the embeddings. Four options are available: None, linear, linear_no_bias, mlm. Setting projection to None will leave the contextualized embeddings as is. The linear and linear_no_bias options project the embeddings using a linear layer. The dimensionality of the resulting embeddings is configured with the embedding_dim option. For example, if embedding_dim is set to 128, the resulting embedding tensor will have shape Sx128. Finally, the mlm option uses the pre-trained masked language modeling head of an encoder model to project the embeddings into the dimensionality of the vocabulary size. This is useful for learned sparse models such as SPLADE.

Sparsification

The parsification step applies applies a function to sparsify the embedding vectors. Three options are available: None, relu, and relu_log. Setting sparsification to None will leave the contextualized embeddings as is. The relu and relu_log options apply a ReLU activation function (and then a logarithm) to set all entries below 0 to 0. This is useful for learned sparse models such as SPLADE.

Pooling

The pooling step aggregates the embedding vectors for every token into a single embedding vector. Five options are available: None, first, mean, max, and sum. Setting pooling to None will not apply pooling and keep all the embedding vectors of all tokens. This option should be used for multi-vector models such als ColBERT. When setting pooling to first the model uses the first token’s contextualized embedding vector as the aggregated embedding (for models with a BERT-backbone this corresponds to [CLS] pooling). The mean, max, and sum options aggregate an embedding vector over all tokens’ embedding vectors using the respective operator.

Normalization

Normalization normalizes the embedding vector(s). It can be either True or False.

Scoring

After embedding both the query and document embeddings, the model computes a relevance score using a scoring function. First, the similarity (either dot product or cosine) between all query and document embeddings vectors is computed. If pooling was applied, the query and document embeddings consist only of a single vector and their similarity is the final relevance score. If no pooling is applied, the similarity scores are aggregated. First, the scoring function computes the maximum similarity over all document embedding vectors per query embedding vector. Finally, the operator to aggregate over the maximum similarities per query embedding vector is parameterizable in the query_aggregation_function option. Four options are available: sum, mean, max, and harmonic_mean.

Cross-Encoder

The CrossEncoderConfig specifies how the model further processes the contextualized embeddings of the backbone encoder model to compute a similarity score. A cross-encoder receives both the query and document as input. To compute a relevance score, the model first aggregates the joint contextualized embeddings using a pooling function. Four options are available: first, mean, max, and sum. When setting pooling to first the model uses the first token’s contextualized embedding vector as the aggregated embedding (for models with a BERT-backbone this corresponds to [CLS] pooling). The mean, max, and sum options aggregate an embedding vector over all tokens’ embedding vectors using the respective operator. The model computes the finla relevance score by applying a linear layer to the pooled contextualized embedding vector.