BiEncoderConfig

class lightning_ir.bi_encoder.config.BiEncoderConfig(query_length: int = 32, doc_length: int = 512, similarity_function: 'cosine' | 'dot' = 'dot', query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, query_pooling_strategy: 'first' | 'mean' | 'max' | 'sum' | None = 'mean', query_mask_scoring_tokens: Sequence[str] | 'punctuation' | None = None, query_aggregation_function: 'sum' | 'mean' | 'max' | 'harmonic_mean' = 'sum', doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, doc_pooling_strategy: 'first' | 'mean' | 'max' | 'sum' | None = 'mean', doc_mask_scoring_tokens: Sequence[str] | 'punctuation' | None = None, normalize: bool = False, sparsification: 'relu' | 'relu_log' | None = None, add_marker_tokens: bool = False, embedding_dim: int = 768, projection: 'linear' | 'linear_no_bias' | 'mlm' | None = 'linear', **kwargs)[source]

Bases: LightningIRConfig

__init__(query_length: int = 32, doc_length: int = 512, similarity_function: 'cosine' | 'dot' = 'dot', query_expansion: bool = False, attend_to_query_expanded_tokens: bool = False, query_pooling_strategy: 'first' | 'mean' | 'max' | 'sum' | None = 'mean', query_mask_scoring_tokens: Sequence[str] | 'punctuation' | None = None, query_aggregation_function: 'sum' | 'mean' | 'max' | 'harmonic_mean' = 'sum', doc_expansion: bool = False, attend_to_doc_expanded_tokens: bool = False, doc_pooling_strategy: 'first' | 'mean' | 'max' | 'sum' | None = 'mean', doc_mask_scoring_tokens: Sequence[str] | 'punctuation' | None = None, normalize: bool = False, sparsification: 'relu' | 'relu_log' | None = None, add_marker_tokens: bool = False, embedding_dim: int = 768, projection: 'linear' | 'linear_no_bias' | 'mlm' | None = 'linear', **kwargs)[source]

Configuration class for a bi-encoder model.

Parameters:
  • query_length (int, optional) – Maximum query length, defaults to 32

  • doc_length (int, optional) – Maximum document length, defaults to 512

  • similarity_function (Literal['cosine', 'dot'], optional) – Similarity function to compute scores between query and document embeddings, defaults to “dot”

  • query_expansion (bool, optional) – Whether to expand queries with mask tokens, defaults to False

  • attend_to_query_expanded_tokens (bool, optional) – Whether to allow query tokens to attend to mask tokens, defaults to False

  • query_pooling_strategy (Literal['first', 'mean', 'max', 'sum'] | None, optional) – Whether and how to pool the query token embeddings, defaults to “mean”

  • query_mask_scoring_tokens (Sequence[str] | Literal['punctuation'] | None, optional) – Whether and which query tokens to ignore during scoring, defaults to None

  • query_aggregation_function (Literal[ 'sum', 'mean', 'max', 'harmonic_mean' ], optional) – How to aggregate similarity scores over query tokens, defaults to “sum”

  • doc_expansion (bool, optional) – Whether to expand documents with mask tokens, defaults to False

  • attend_to_doc_expanded_tokens (bool, optional) – Whether to allow document tokens to attend to mask tokens, defaults to False

  • doc_pooling_strategy (Literal['first', 'mean', 'max', 'sum'] | None, optional) – Whether andhow to pool document token embeddings, defaults to “mean”

  • doc_mask_scoring_tokens (Sequence[str] | Literal['punctuation'] | None, optional) – Whether and which document tokens to ignore during scoring, defaults to None

  • normalize (bool, optional) – Whether to normalize query and document embeddings, defaults to False

  • sparsification (Literal['relu', 'relu_log'] | None, optional) – Whether and which sparsification function to apply, defaults to None

  • add_marker_tokens (bool, optional) – Whether to add extra marker tokens [Q] / [D] to queries / documents, defaults to False

  • embedding_dim (int, optional) – The output embedding dimension, defaults to 768

  • projection (Literal['linear', 'linear_no_bias', 'mlm'] | None, optional) – Whether and how to project the output emeddings, defaults to “linear”

Methods

__init__([query_length, doc_length, ...])

Configuration class for a bi-encoder model.

dict_torch_dtype_to_str(d)

Checks whether the passed dictionary and its nested dicts have a torch_dtype key and if it's not None, converts torch.dtype to a string of just the type.

from_dict(config_dict, **kwargs)

Instantiates a [PretrainedConfig] from a Python dictionary of parameters.

from_json_file(json_file)

Instantiates a [PretrainedConfig] from the path to a JSON file of parameters.

from_pretrained(...)

Loads the configuration from a pretrained model.

get_config_dict(...)

Overrides the transformers.PretrainedConfig.get_config_dict method to load the tokens that should be masked during scoring.

get_text_config([decoder])

Returns the config that is meant to be used with text IO.

push_to_hub(repo_id[, use_temp_dir, ...])

Upload the configuration file to the 🤗 Model Hub.

register_for_auto_class([auto_class])

Register this class with a given auto class.

save_pretrained(save_directory, **kwargs)

Overrides the transformers.PretrainedConfig.save_pretrained method to addtionally save the tokens which should be maksed during scoring.

to_added_args_dict()

Outputs a dictionary of the added arguments.

to_dict()

Overrides the transformers.PretrainedConfig.to_dict method to include the added arguments, the backbone model type, and remove the mask scoring tokens.

to_diff_dict()

Removes all attributes from config which correspond to the default config attributes for better readability and serializes to a Python dictionary.

to_json_file(json_file_path[, use_diff])

Save this instance to a JSON file.

to_json_string([use_diff])

Serializes this instance to a JSON string.

to_tokenizer_dict()

Outputs a dictionary of the tokenizer arguments.

update(config_dict)

Updates attributes of this class with attributes from config_dict.

update_from_string(update_str)

Updates attributes of this class with attributes from update_str.

Attributes

ADDED_ARGS

Arguments added to the configuration.

TOKENIZER_ARGS

Arguments for the tokenizer.

attribute_map

backbone_model_type

Backbone model type for the configuration.

base_config_key

base_model_pp_plan

base_model_tp_plan

is_composition

model_type

Model type for bi-encoder models.

name_or_path

num_labels

The number of labels for classification models.

sub_configs

use_return_dict

Whether or not return [~utils.ModelOutput] instead of tuples.

ADDED_ARGS: Set[str] = {'add_marker_tokens', 'attend_to_doc_expanded_tokens', 'attend_to_query_expanded_tokens', 'doc_expansion', 'doc_length', 'doc_mask_scoring_tokens', 'doc_pooling_strategy', 'embedding_dim', 'normalize', 'projection', 'query_aggregation_function', 'query_expansion', 'query_length', 'query_mask_scoring_tokens', 'query_pooling_strategy', 'similarity_function', 'sparsification'}

Arguments added to the configuration.

TOKENIZER_ARGS: Set[str] = {'add_marker_tokens', 'attend_to_doc_expanded_tokens', 'attend_to_query_expanded_tokens', 'doc_expansion', 'doc_length', 'query_expansion', 'query_length'}

Arguments for the tokenizer.

dict_torch_dtype_to_str(d: Dict[str, Any]) None

Checks whether the passed dictionary and its nested dicts have a torch_dtype key and if it’s not None, converts torch.dtype to a string of just the type. For example, torch.float32 get converted into “float32” string, which can then be stored in the json format.

classmethod from_dict(config_dict: Dict[str, Any], **kwargs) PretrainedConfig

Instantiates a [PretrainedConfig] from a Python dictionary of parameters.

Parameters:
  • config_dict (Dict[str, Any]) – Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the [~PretrainedConfig.get_config_dict] method.

  • kwargs (Dict[str, Any]) – Additional parameters from which to initialize the configuration object.

Returns:

The configuration object instantiated from those parameters.

Return type:

[PretrainedConfig]

classmethod from_json_file(json_file: str | PathLike) PretrainedConfig

Instantiates a [PretrainedConfig] from the path to a JSON file of parameters.

Parameters:

json_file (str or os.PathLike) – Path to the JSON file containing the parameters.

Returns:

The configuration object instantiated from that JSON file.

Return type:

[PretrainedConfig]

classmethod from_pretrained(pretrained_model_name_or_path: str | Path, *args, **kwargs) LightningIRConfig

Loads the configuration from a pretrained model. Wraps the transformers.PretrainedConfig.from_pretrained

Parameters:

pretrained_model_name_or_path (str | Path) – Pretrained model name or path

Raises:

ValueError – If pre_trained_model_name_or_path is not a Lightning IR model and no LightningIRConfig is passed

Returns:

Derived LightningIRConfig class

Return type:

LightningIRConfig

classmethod get_config_dict(pretrained_model_name_or_path: str | PathLike, **kwargs) Tuple[Dict[str, Any], Dict[str, Any]][source]

Overrides the transformers.PretrainedConfig.get_config_dict method to load the tokens that should be masked during scoring.

Parameters:

pretrained_model_name_or_path (str | PathLike) – Name or path of the pretrained model

Returns:

Configuration dictionary and additional keyword arguments

Return type:

Tuple[Dict[str, Any], Dict[str, Any]]

get_text_config(decoder=False) PretrainedConfig

Returns the config that is meant to be used with text IO. On most models, it is the original config instance itself. On specific composite models, it is under a set of valid names.

If decoder is set to True, then only search for decoder config names.

model_type: str = 'bi-encoder'

Model type for bi-encoder models.

property num_labels: int

The number of labels for classification models.

Type:

int

push_to_hub(repo_id: str, use_temp_dir: bool | None = None, commit_message: str | None = None, private: bool | None = None, token: bool | str | None = None, max_shard_size: int | str | None = '5GB', create_pr: bool = False, safe_serialization: bool = True, revision: str = None, commit_description: str = None, tags: List[str] | None = None, **deprecated_kwargs) str

Upload the configuration file to the 🤗 Model Hub.

Parameters:
  • repo_id (str) – The name of the repository you want to push your config to. It should contain your organization name when pushing to a given organization.

  • use_temp_dir (bool, optional) – Whether or not to use a temporary directory to store the files saved before they are pushed to the Hub. Will default to True if there is no directory named like repo_id, False otherwise.

  • commit_message (str, optional) – Message to commit while pushing. Will default to “Upload config”.

  • private (bool, optional) – Whether to make the repo private. If None (default), the repo will be public unless the organization’s default is private. This value is ignored if the repo already exists.

  • token (bool or str, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface). Will default to True if repo_url is not specified.

  • max_shard_size (int or str, optional, defaults to “5GB”) – Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like “5MB”). We default it to “5GB” so that users can easily load models on free-tier Google Colab instances without any CPU OOM issues.

  • create_pr (bool, optional, defaults to False) – Whether or not to create a PR with the uploaded files or directly commit.

  • safe_serialization (bool, optional, defaults to True) – Whether or not to convert the model weights in safetensors format for safer serialization.

  • revision (str, optional) – Branch to push the uploaded files to.

  • commit_description (str, optional) – The description of the commit that will be created

  • tags (List[str], optional) – List of tags to push on the Hub.

Examples:

```python from transformers import AutoConfig

config = AutoConfig.from_pretrained(“google-bert/bert-base-cased”)

# Push the config to your namespace with the name “my-finetuned-bert”. config.push_to_hub(“my-finetuned-bert”)

# Push the config to an organization with the name “my-finetuned-bert”. config.push_to_hub(“huggingface/my-finetuned-bert”) ```

classmethod register_for_auto_class(auto_class='AutoConfig')

Register this class with a given auto class. This should only be used for custom configurations as the ones in the library are already mapped with AutoConfig.

<Tip warning={true}>

This API is experimental and may have some slight breaking changes in the next releases.

</Tip>

Parameters:

auto_class (str or type, optional, defaults to “AutoConfig”) – The auto class to register this new configuration with.

save_pretrained(save_directory: str | PathLike, **kwargs) None[source]

Overrides the transformers.PretrainedConfig.save_pretrained method to addtionally save the tokens which should be maksed during scoring.

Parameters:

save_directory (str | PathLike) – Directory to save the configuration

to_added_args_dict() Dict[str, Any]

Outputs a dictionary of the added arguments.

Returns:

Added arguments

Return type:

Dict[str, Any]

to_dict() Dict[str, Any][source]

Overrides the transformers.PretrainedConfig.to_dict method to include the added arguments, the backbone model type, and remove the mask scoring tokens.

Returns:

Configuration dictionary

Return type:

Dict[str, Any]

to_diff_dict() Dict[str, Any]

Removes all attributes from config which correspond to the default config attributes for better readability and serializes to a Python dictionary.

Returns:

Dictionary of all the attributes that make up this configuration instance,

Return type:

Dict[str, Any]

to_json_file(json_file_path: str | PathLike, use_diff: bool = True)

Save this instance to a JSON file.

Parameters:
  • json_file_path (str or os.PathLike) – Path to the JSON file in which this configuration instance’s parameters will be saved.

  • use_diff (bool, optional, defaults to True) – If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON file.

to_json_string(use_diff: bool = True) str

Serializes this instance to a JSON string.

Parameters:

use_diff (bool, optional, defaults to True) – If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON string.

Returns:

String containing all the attributes that make up this configuration instance in JSON format.

Return type:

str

to_tokenizer_dict() Dict[str, Any]

Outputs a dictionary of the tokenizer arguments.

Returns:

Tokenizer arguments

Return type:

Dict[str, Any]

update(config_dict: Dict[str, Any])

Updates attributes of this class with attributes from config_dict.

Parameters:

config_dict (Dict[str, Any]) – Dictionary of attributes that should be updated for this class.

update_from_string(update_str: str)

Updates attributes of this class with attributes from update_str.

The expected format is ints, floats and strings as is, and for booleans use true or false. For example: “n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index”

The keys to change have to already exist in the config object.

Parameters:

update_str (str) – String with attributes that should be updated for this class.

property use_return_dict: bool

Whether or not return [~utils.ModelOutput] instead of tuples.

Type:

bool