RunDataset

class lightning_ir.data.dataset.RunDataset(run_path_or_id: Path | str, depth: int = -1, sample_size: int = -1, sampling_strategy: 'single_relevant' | 'top' | 'random' | 'log_random' | 'top_and_random' = 'top', targets: 'relevance' | 'subtopic_relevance' | 'rank' | 'score' | None = None, normalize_targets: bool = False, add_docs_not_in_ranking: bool = False)[source]

Bases: _IRDataset, Dataset

__init__(run_path_or_id: Path | str, depth: int = -1, sample_size: int = -1, sampling_strategy: 'single_relevant' | 'top' | 'random' | 'log_random' | 'top_and_random' = 'top', targets: 'relevance' | 'subtopic_relevance' | 'rank' | 'score' | None = None, normalize_targets: bool = False, add_docs_not_in_ranking: bool = False) None[source]

Dataset containing a list of queries with a ranked list of documents per query. Subsets of the ranked list can be sampled using different sampling strategies.

Parameters:
  • run_path_or_id (Path | str) – Path to a run file or valid ir_datasets id

  • depth (int, optional) – Depth at which to cut off the ranking. If -1 the full ranking is kept, defaults to -1

  • sample_size (int, optional) – The number of documents to sample per query, defaults to -1

  • sampling_strategy (Literal['single_relevant', 'top', 'random', 'log_random', 'top_and_random'], optional) – The sample strategy to use to sample documents, defaults to “top”

  • targets (Literal['relevance', 'subtopic_relevance', 'rank', 'score'] | None, optional) – The data type to use as targets for a model during fine-tuning. If relevance the relevance judgements are parsed from qrels, defaults to None

  • normalize_targets (bool, optional) – Whether to normalize the targets between 0 and 1, defaults to False

  • add_docs_not_in_ranking (bool, optional) – Whether to add relevant to a sample that are in the qrels but not in the ranking, defaults to False

Methods

__init__(run_path_or_id[, depth, ...])

Dataset containing a list of queries with a ranked list of documents per query.

Attributes

DASHED_DATASET_MAP

Map of dataset names with dashes to dataset names with slashes.

dataset

Dataset name.

dataset_id

Dataset id.

docs

Documents in the dataset.

docs_dataset_id

ID of the dataset containing the documents.

ir_dataset

Instance of ir_datasets.Dataset.

qrels

The qrels in the dataset.

queries

Queries in the dataset.

property DASHED_DATASET_MAP: Dict[str, str]

Map of dataset names with dashes to dataset names with slashes.

Returns:

Dataset map

Return type:

Dict[str, str]

property dataset: str

Dataset name.

Returns:

Dataset name

Return type:

str

property dataset_id: str

Dataset id.

Returns:

Dataset id

Return type:

str

property docs: Docstore | Dict[str, GenericDoc]

Documents in the dataset.

Raises:

ValueError – If no documents are found in the dataset

Returns:

Documents

Return type:

ir_datasets.indices.Docstore | Dict[str, GenericDoc]

property docs_dataset_id: str

ID of the dataset containing the documents.

Returns:

Document dataset id

Return type:

str

property ir_dataset: Dataset | None

Instance of ir_datasets.Dataset.

Returns:

ir_datasets dataset

Return type:

ir_datasets.Dataset | None

property qrels: DataFrame | None

The qrels in the dataset. If the dataset does not contain qrels, the qrels are None.

Returns:

Qrels

Return type:

pd.DataFrame | None

property queries: Series

Queries in the dataset.

Raises:

ValueError – If no queries are found in the dataset

Returns:

Queries

Return type:

pd.Series