BiEncoderConfig

Bases: LightningIRConfig

Configuration class for a bi-encoder model.

Parameters:

query_length (int, optional) – Maximum query length, defaults to 32
doc_length (int, optional) – Maximum document length, defaults to 512
similarity_function (Literal['cosine', 'dot'], optional) – Similarity function to compute scores between query and document embeddings, defaults to “dot”
query_expansion (bool, optional) – Whether to expand queries with mask tokens, defaults to False
attend_to_query_expanded_tokens (bool, optional) – Whether to allow query tokens to attend to mask tokens, defaults to False
query_pooling_strategy (Literal['first', 'mean', 'max', 'sum'] | None, optional) – Whether and how to pool the query token embeddings, defaults to “mean”
query_mask_scoring_tokens (Sequence[str] | Literal['punctuation'] | None, optional) – Whether and which query tokens to ignore during scoring, defaults to None
query_aggregation_function (Literal[ 'sum', 'mean', 'max', 'harmonic_mean' ], optional) – How to aggregate similarity scores over query tokens, defaults to “sum”
doc_expansion (bool, optional) – Whether to expand documents with mask tokens, defaults to False
attend_to_doc_expanded_tokens (bool, optional) – Whether to allow document tokens to attend to mask tokens, defaults to False
doc_pooling_strategy (Literal['first', 'mean', 'max', 'sum'] | None, optional) – Whether andhow to pool document token embeddings, defaults to “mean”
doc_mask_scoring_tokens (Sequence[str] | Literal['punctuation'] | None, optional) – Whether and which document tokens to ignore during scoring, defaults to None
normalize (bool, optional) – Whether to normalize query and document embeddings, defaults to False
sparsification (Literal['relu', 'relu_log'] | None, optional) – Whether and which sparsification function to apply, defaults to None
add_marker_tokens (bool, optional) – Whether to add extra marker tokens [Q] / [D] to queries / documents, defaults to False
embedding_dim (int, optional) – The output embedding dimension, defaults to 768
projection (Literal['linear', 'linear_no_bias', 'mlm'] | None, optional) – Whether and how to project the output emeddings, defaults to “linear”

Methods

`__init__`([query_length, doc_length, ...])	Configuration class for a bi-encoder model.
`dict_torch_dtype_to_str`(d)	Checks whether the passed dictionary and its nested dicts have a torch_dtype key and if it's not None, converts torch.dtype to a string of just the type.
`from_dict`(config_dict, **kwargs)	Instantiates a [PretrainedConfig] from a Python dictionary of parameters.
`from_json_file`(json_file)	Instantiates a [PretrainedConfig] from the path to a JSON file of parameters.
`from_pretrained`(...)	Loads the configuration from a pretrained model.
`get_config_dict`(...)	Overrides the transformers.PretrainedConfig.get_config_dict method to load the tokens that should be masked during scoring.
`get_text_config`([decoder])	Returns the config that is meant to be used with text IO.
`push_to_hub`(repo_id[, use_temp_dir, ...])	Upload the configuration file to the 🤗 Model Hub.
`register_for_auto_class`([auto_class])	Register this class with a given auto class.
`save_pretrained`(save_directory, **kwargs)	Overrides the transformers.PretrainedConfig.save_pretrained method to addtionally save the tokens which should be maksed during scoring.
`to_added_args_dict`()	Outputs a dictionary of the added arguments.
`to_dict`()	Overrides the transformers.PretrainedConfig.to_dict method to include the added arguments, the backbone model type, and remove the mask scoring tokens.
`to_diff_dict`()	Removes all attributes from config which correspond to the default config attributes for better readability and serializes to a Python dictionary.
`to_json_file`(json_file_path[, use_diff])	Save this instance to a JSON file.
`to_json_string`([use_diff])	Serializes this instance to a JSON string.
`to_tokenizer_dict`()	Outputs a dictionary of the tokenizer arguments.
`update`(config_dict)	Updates attributes of this class with attributes from config_dict.
`update_from_string`(update_str)	Updates attributes of this class with attributes from update_str.

Attributes

`ADDED_ARGS`	Arguments added to the configuration.
`TOKENIZER_ARGS`	Arguments for the tokenizer.
`attribute_map`
`backbone_model_type`	Backbone model type for the configuration.
`base_config_key`
`base_model_pp_plan`
`base_model_tp_plan`
`is_composition`
`model_type`	Model type for bi-encoder models.
`name_or_path`
`num_labels`	The number of labels for classification models.
`sub_configs`
`use_return_dict`	Whether or not return [~utils.ModelOutput] instead of tuples.

ADDED_ARGS: Set[str] = {'add_marker_tokens', 'attend_to_doc_expanded_tokens', 'attend_to_query_expanded_tokens', 'doc_expansion', 'doc_length', 'doc_mask_scoring_tokens', 'doc_pooling_strategy', 'embedding_dim', 'normalize', 'projection', 'query_aggregation_function', 'query_expansion', 'query_length', 'query_mask_scoring_tokens', 'query_pooling_strategy', 'similarity_function', 'sparsification'}: Arguments added to the configuration.

TOKENIZER_ARGS: Set[str] = {'add_marker_tokens', 'attend_to_doc_expanded_tokens', 'attend_to_query_expanded_tokens', 'doc_expansion', 'doc_length', 'query_expansion', 'query_length'}: Arguments for the tokenizer.

dict_torch_dtype_to_str(d: Dict[str, Any]) → None: Checks whether the passed dictionary and its nested dicts have a torch_dtype key and if it’s not None, converts torch.dtype to a string of just the type. For example, torch.float32 get converted into “float32” string, which can then be stored in the json format.

classmethod from_dict(config_dict: Dict[str, Any], **kwargs) → PretrainedConfig

Instantiates a [PretrainedConfig] from a Python dictionary of parameters.

Parameters:

config_dict (Dict[str, Any]) – Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the [~PretrainedConfig.get_config_dict] method.
kwargs (Dict[str, Any]) – Additional parameters from which to initialize the configuration object.

Returns:

The configuration object instantiated from those parameters.

Return type:

[PretrainedConfig]

classmethod from_json_file(json_file: str | PathLike) → PretrainedConfig

Instantiates a [PretrainedConfig] from the path to a JSON file of parameters.

Parameters:: json_file (str or os.PathLike) – Path to the JSON file containing the parameters.
Returns:: The configuration object instantiated from that JSON file.
Return type:: [PretrainedConfig]

classmethod from_pretrained(pretrained_model_name_or_path: str | Path, *args, **kwargs) → LightningIRConfig

Loads the configuration from a pretrained model. Wraps the transformers.PretrainedConfig.from_pretrained

Parameters:: pretrained_model_name_or_path (str | Path) – Pretrained model name or path
Raises:: ValueError – If pre_trained_model_name_or_path is not a Lightning IR model and no LightningIRConfig is passed
Returns:: Derived LightningIRConfig class
Return type:: LightningIRConfig

classmethod get_config_dict(pretrained_model_name_or_path: str | PathLike, **kwargs) → Tuple[Dict[str, Any], Dict[str, Any]][source]

Overrides the transformers.PretrainedConfig.get_config_dict method to load the tokens that should be masked during scoring.

Parameters:: pretrained_model_name_or_path (str | PathLike) – Name or path of the pretrained model
Returns:: Configuration dictionary and additional keyword arguments
Return type:: Tuple[Dict[str, Any], Dict[str, Any]]

get_text_config(decoder=False) → PretrainedConfig

Returns the config that is meant to be used with text IO. On most models, it is the original config instance itself. On specific composite models, it is under a set of valid names.

If decoder is set to True, then only search for decoder config names.

model_type: str = 'bi-encoder': Model type for bi-encoder models.

property num_labels: int

The number of labels for classification models.

Type:: int

Upload the configuration file to the 🤗 Model Hub.

Parameters:

repo_id (str) – The name of the repository you want to push your config to. It should contain your organization name when pushing to a given organization.
use_temp_dir (bool, optional) – Whether or not to use a temporary directory to store the files saved before they are pushed to the Hub. Will default to True if there is no directory named like repo_id, False otherwise.
commit_message (str, optional) – Message to commit while pushing. Will default to “Upload config”.
private (bool, optional) – Whether to make the repo private. If None (default), the repo will be public unless the organization’s default is private. This value is ignored if the repo already exists.
token (bool or str, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface). Will default to True if repo_url is not specified.
max_shard_size (int or str, optional, defaults to “5GB”) – Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like “5MB”). We default it to “5GB” so that users can easily load models on free-tier Google Colab instances without any CPU OOM issues.
create_pr (bool, optional, defaults to False) – Whether or not to create a PR with the uploaded files or directly commit.
safe_serialization (bool, optional, defaults to True) – Whether or not to convert the model weights in safetensors format for safer serialization.
revision (str, optional) – Branch to push the uploaded files to.
commit_description (str, optional) – The description of the commit that will be created
tags (List[str], optional) – List of tags to push on the Hub.

Examples:

```python from transformers import AutoConfig

config = AutoConfig.from_pretrained(“google-bert/bert-base-cased”)

# Push the config to your namespace with the name “my-finetuned-bert”. config.push_to_hub(“my-finetuned-bert”)

# Push the config to an organization with the name “my-finetuned-bert”. config.push_to_hub(“huggingface/my-finetuned-bert”) ```

classmethod register_for_auto_class(auto_class='AutoConfig')

Register this class with a given auto class. This should only be used for custom configurations as the ones in the library are already mapped with AutoConfig.

This API is experimental and may have some slight breaking changes in the next releases.

</Tip>

Parameters:: auto_class (str or type, optional, defaults to “AutoConfig”) – The auto class to register this new configuration with.

save_pretrained(save_directory: str | PathLike, **kwargs) → None[source]

Overrides the transformers.PretrainedConfig.save_pretrained method to addtionally save the tokens which should be maksed during scoring.

Parameters:: save_directory (str | PathLike) – Directory to save the configuration

to_added_args_dict() → Dict[str, Any]

Outputs a dictionary of the added arguments.

Returns:: Added arguments
Return type:: Dict[str, Any]

to_dict() → Dict[str, Any][source]

Overrides the transformers.PretrainedConfig.to_dict method to include the added arguments, the backbone model type, and remove the mask scoring tokens.

Returns:: Configuration dictionary
Return type:: Dict[str, Any]

to_diff_dict() → Dict[str, Any]

Removes all attributes from config which correspond to the default config attributes for better readability and serializes to a Python dictionary.

Returns:: Dictionary of all the attributes that make up this configuration instance,
Return type:: Dict[str, Any]

to_json_file(json_file_path: str | PathLike, use_diff: bool = True)

Save this instance to a JSON file.

Parameters:

json_file_path (str or os.PathLike) – Path to the JSON file in which this configuration instance’s parameters will be saved.
use_diff (bool, optional, defaults to True) – If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON file.

to_json_string(use_diff: bool = True) → str

Serializes this instance to a JSON string.

Parameters:: use_diff (bool, optional, defaults to True) – If set to True, only the difference between the config instance and the default PretrainedConfig() is serialized to JSON string.
Returns:: String containing all the attributes that make up this configuration instance in JSON format.
Return type:: str

to_tokenizer_dict() → Dict[str, Any]

Outputs a dictionary of the tokenizer arguments.

Returns:: Tokenizer arguments
Return type:: Dict[str, Any]

update(config_dict: Dict[str, Any])

Updates attributes of this class with attributes from config_dict.

Parameters:: config_dict (Dict[str, Any]) – Dictionary of attributes that should be updated for this class.

update_from_string(update_str: str)

Updates attributes of this class with attributes from update_str.

The expected format is ints, floats and strings as is, and for booleans use true or false. For example: “n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index”

The keys to change have to already exist in the config object.

Parameters:: update_str (str) – String with attributes that should be updated for this class.

property use_return_dict: bool

Whether or not return [~utils.ModelOutput] instead of tuples.

Type:: bool