hyrax.config_schemas.data_request

hyrax.config_schemas.data_request#

Pydantic models describing the structure of the data_request configuration.

These schemas validate and enforce the structure of dataset requests used throughout the Hyrax framework.

Attributes#

DatasetGroupValue

Classes#

`DataRequestConfig`	Per-dataset configuration used within `data_request`.
`DataRequestDefinition`	Typed representation of the full `data_request` table.

Functions#

`_normalize_dataset_group`(→ DatasetGroupValue)	Normalize a single dataset group value into a `dict[str, DataRequestConfig]`.
`_iter_all_configs`(→ list[tuple[str, DataRequestConfig]])	Yield `(group_name, config)` pairs across all groups.

Module Contents#

class DataRequestConfig(/, **data: Any)[source]#

Bases: hyrax.config_schemas.base.BaseConfigModel

Per-dataset configuration used within data_request.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

dataset_class: str = None[source]#

data_location: str = None[source]#

fields: list[str] | None = None[source]#

primary_id_field: str | None = None[source]#

join_field: str | None = None[source]#

dataset_config: dict | None = None[source]#

augment: bool | list[str] | None = None[source]#

classmethod resolve_data_location(v: str) → str[source]#: Fully resolve the data_location path, expanding user home directories and converting relative paths to absolute paths.

join_field_excludes_primary() → DataRequestConfig[source]#: Ensure that join_field and primary_id_field are mutually exclusive.

validate_augment_list() → DataRequestConfig[source]#: Validate the list form of augment against fields and primary_id_field.

as_dict(*, exclude_unset: bool = False) → dict[str, Any][source]#: Return the configuration as a plain dictionary.

DatasetGroupValue[source]#

_normalize_dataset_group(value: Any) → DatasetGroupValue[source]#

Normalize a single dataset group value into a dict[str, DataRequestConfig].

Every dataset source within a group must be identified by a user-supplied friendly name. The friendly name is the key in the returned dict and is used by DataProvider to reference the dataset at runtime.

Accepted inputs#

A dict whose values are DataRequestConfig instances or plain dicts that can be validated as one. The keys become the friendly names.

Rejected inputs (raise `ValueError`)#

A flat dict that contains dataset_class at the top level (no friendly name wrapper).
A bare DataRequestConfig instance (no friendly name wrapper).

_iter_all_configs(groups: dict[str, DatasetGroupValue]) → list[tuple[str, DataRequestConfig]][source]#: Yield (group_name, config) pairs across all groups.

class DataRequestDefinition[source]#

Bases: pydantic.RootModel[dict[str, DatasetGroupValue]]

Typed representation of the full data_request table.

Accepts any number of arbitrarily-named dataset groups (e.g. train, validate, infer, test, finetune, …). Each group value is a dict of friendly-named DataRequestConfig instances. A friendly name must always be provided explicitly — the schema will raise a validation error if a dataset source is specified without one.

Example (Python):

{
    "train": {
        "my_dataset": {
            "dataset_class": "HyraxRandomDataset",
            "data_location": "/path/to/data",
            "primary_id_field": "object_id",
        }
    }
}

Example (TOML):

[data_request.train.my_dataset]
dataset_class = "HyraxRandomDataset"
data_location = "/path/to/data"
primary_id_field = "object_id"

classmethod normalize_all_groups(value: Any) → dict[str, DatasetGroupValue][source]#: Parse every top-level key into the expected group format.

reject_augment_on_infer() → DataRequestDefinition[source]#: Augmentation cannot be enabled on the ‘infer’ data group.

require_at_least_one_dataset() → DataRequestDefinition[source]#: Ensure at least one dataset group is provided.

validate_primary_id_fields() → DataRequestDefinition[source]#

Validate that exactly one DataRequestConfig in each dataset group has a non-None primary_id_field.

This ensures that when multiple datasets are requested (e.g., a group contains a dict of multiple DataRequestConfig instances), exactly one of them specifies which field to use as the primary identifier.

validate_cross_group(groups: set[str]) → None[source]#: No-op: cross-group split validation is now handled by splitting_utils.validate_split_config.

__contains__(key: str) → bool[source]#: Return True if the group name is present in the definition.

__getitem__(key: str) → DatasetGroupValue[source]#: Return the dataset group value for the given group name.

as_dict(*, exclude_unset: bool = False) → dict[str, Any][source]#

Export as a nested dictionary compatible with existing configs.

Each group value is a dict of {friendly_name: flat_config_dict}. No implicit "data" wrapper is added — the friendly names supplied by the user are preserved verbatim.