Model
At a high-level, a LightningIRModel
is a wrapper around a pre-trained Hugging Face model that specifies how the embeddings generated by that pre-trained model should be aggregated to compute a relevance score between a query and a document. The behavior of the model is dictated by a LightningIRConfig
. Two types of models are supported: BiEncoderModel
and CrossEncoderModel
, each with their correpsonding BiEncoderConfig
and CrossEncoderConfig
.
A LightningIRModel
is backbone agnostic and can be used with any pre-trained model from the Hugging Face Model Hub. To initialize a new LightningIRModel
, select a pre-trained model from the Hugging Face Model Hub, create a LightningIRConfig
, and pass both to the from_pretrained()
method. Models already fine-tuned using Lightning IR can be loaded directly without specifying a config. See the Model Zoo for a list of pre-trained models.
from lightning_ir import LightningIRModel, BiEncoderConfig
model = LightningIRModel.from_pretrained(
"bert-base-uncased", config=BiEncoderConfig()
)
print(type(model))
# <class 'lightning_ir.base.class_factory.BiEncoderBertModel'>
model = LightningIRModel.from_pretrained(
"google/electra-base-discriminator", config=CrossEncoderConfig()
)
print(type(model))
# <class 'lightning_ir.base.class_factory.CrossEncoderElectraModel'>
Bi-encoder models compute a relevance score by embedding the query and document separately and compute the similarity between the two embeddings. A cross-encoder receives both the query and document as input and computes a relevance score based on the joint contextualized embedding. See the Bi-Encoder and Cross-Encoder sections for more details.
The easiest way to use a LightningIRModel
is through it’s corresponding LightningIRModule
. The module combines a LightningIRModel
and LightningIRTokenizer
and handles the forward pass of the model. The following example illustrates how to use a bi-encoder or a cross-encoder model to score the relevance between a query and a document. Note the bi-encoder generates two embedding tensors while the cross-encoder generates a single joint embedding tensor.
from lightning_ir import BiEncoderModule, CrossEncoderModule
bi_encoder = BiEncoderModule("webis/bert-bi-encoder")
cross_encoder = CrossEncoderModule("webis/monoelectra-base")
query = "What is the capital of France?"
docs = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
bi_encoder_output = bi_encoder.score(query, docs)
cross_encoder_output = cross_encoder.score(query, docs)
print(bi_encoder_output.scores)
# tensor([38.9621, 29.7557])
print(bi_encoder_output.query_embeddings.embeddings.shape)
# torch.Size([1, 1, 768])
print(bi_encoder_output.doc_embeddings.embeddings.shape)
# torch.Size([2, 1, 768])
print(cross_encoder_output.scores)
# tensor([ 7.7892, -3.5815])
print(cross_encoder_output.embeddings.shape)
# torch.Size([2, 1, 768])
Bi-Encoder
The BiEncoderConfig
specifies how the contextualized embeddings of a pre-trained Hugging Face model are further processed and how the relevance score is computed based on the embeddings. The processing pipeline includes four steps which are executed in order and can be separately configured for query and document processing: projection, sparsification, pooling, and normalization. The flexibility of this pipeline allows for configuring a plethora of popular bi-encoder models, from learned sparse models like SPLADE to dense multi-vector models like ColBERT. The following sections go over the pipeline stages in detail and how these can be configured.
Backbone Encoding
First, an input sequence of tokens (e.g., the query or document), are fed through a pre-trained backbone language model from Hugging Face. The model generates contextualized embeddings, one vector for each token in the input sequence, that is passed to the projection step.
Projection
The projection step scales the dimensionality of the embeddings. Four options are available: None
, linear
, linear_no_bias
, mlm
. Setting projection to None
will leave the contextualized embeddings as is. The linear
and linear_no_bias
options project the embeddings using a linear layer. The dimensionality of the resulting embeddings is configured with the embedding_dim
option. For example, if embedding_dim
is set to 128
, the resulting embedding tensor will have shape Sx128
. Finally, the mlm
option uses the pre-trained masked language modeling head of an encoder model to project the embeddings into the dimensionality of the vocabulary size. This is useful for learned sparse models such as SPLADE.
Sparsification
The parsification step applies applies a function to sparsify the embedding vectors. Three options are available: None
, relu
, and relu_log
. Setting sparsification to None
will leave the contextualized embeddings as is. The relu
and relu_log
options apply a ReLU activation function (and then a logarithm) to set all entries below 0 to 0. This is useful for learned sparse models such as SPLADE.
Pooling
The pooling step aggregates the embedding vectors for every token into a single embedding vector. Five options are available: None
, first
, mean
, max
, and sum
. Setting pooling to None
will not apply pooling and keep all the embedding vectors of all tokens. This option should be used for multi-vector models such als ColBERT. When setting pooling to first
the model uses the first token’s contextualized embedding vector as the aggregated embedding (for models with a BERT-backbone this corresponds to [CLS] pooling). The mean
, max
, and sum
options aggregate an embedding vector over all tokens’ embedding vectors using the respective operator.
Normalization
Normalization normalizes the embedding vector(s). It can be either True
or False
.
Scoring
After embedding both the query and document embeddings, the model computes a relevance score using a scoring function. First, the similarity (either dot
product or cosine
) between all query and document embeddings vectors is computed. If pooling was applied, the query and document embeddings consist only of a single vector and their similarity is the final relevance score. If no pooling is applied, the similarity scores are aggregated. First, the scoring function computes the maximum similarity over all document embedding vectors per query embedding vector. Finally, the operator to aggregate over the maximum similarities per query embedding vector is parameterizable in the query_aggregation_function
option. Four options are available: sum
, mean
, max
, and harmonic_mean
.
Cross-Encoder
The CrossEncoderConfig
specifies how the model further processes the contextualized embeddings of the backbone encoder model to compute a similarity score. A cross-encoder receives both the query and document as input. To compute a relevance score, the model first aggregates the joint contextualized embeddings using a pooling function. Four options are available: first
, mean
, max
, and sum
. When setting pooling to first
the model uses the first token’s contextualized embedding vector as the aggregated embedding (for models with a BERT-backbone this corresponds to [CLS] pooling). The mean
, max
, and sum
options aggregate an embedding vector over all tokens’ embedding vectors using the respective operator. The model computes the finla relevance score by applying a linear layer to the pooled contextualized embedding vector.