Requesting data with Hyrax#

How to tell Hyrax what data to use.#

One of the core aspects of Hyrax is its ability to move data around in a reproducible way. To support this, the user must tell Hyrax what data is meant to be used and for what purpose.

This information is provided to Hyrax as a “data request”. In this notebook we explain how to construct a data request. Throughout this notebook we define data requests as Python dictionaries, as you would in a notebook workflow. They can just as easily be defined in a TOML config file.

Terminology#

A few terms are helpful to keep in mind while working through this notebook.

  • data request : The structured specification of what data Hyrax should use for a given task or verb.

  • data group : A grouping of datasets requested for a particular task (e.g. "train", "validate", "infer").

  • dataset : One specific source of data within a data group.

  • friendly name : The user-assigned name for a particular dataset within a data group.

Basic data request#

Here we define a minimal request for inference data.

Every dataset requires the following fields:

  • friendly_name (the dictionary key): A human-readable name for the dataset, e.g. "my_data". This key appears in every sample returned by the DataProvider, so choose a descriptive name.

  • dataset_class: The name of the dataset class to use. For built-in Hyrax datasets, use just the class name (e.g. "HyraxCifarDataset"). For external datasets, use the fully qualified import path (e.g. "my_package.my_module.MyClass").

  • data_location: The path to the data directory.

  • primary_id_field: The name of the dataset field that provides a unique identifier for each sample (e.g. "object_id").

[1]:
data_request = {
    "infer": {
        "my_data": {  # <- The friendly name
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./data",
            "primary_id_field": "object_id",
        }
    }
}

Setting the data request#

In practice, once the data request is defined, we would need to add it to the Hyrax runtime configuration. To ensure that the configuration is fully resolved, use the set_config() method.

Some Hyrax verbs require specific data group names. For example, the infer verb requires a group named "infer", and the train verb requires a group named "train".

[2]:
from hyrax import Hyrax

h = Hyrax()
h.set_config("data_request", data_request)

Optional data groups with train and validate#

The train verb requires a "train" data group and optionally accepts a "validate" data group. When the "validate" data group is present, a validation epoch is run after each training epoch.

Note that the data_location values differ between the two groups below. This is how Hyrax supports data splits defined as separate sets of files.

[3]:
data_request = {
    "train": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./train_data",
            "primary_id_field": "object_id",
        }
    },
    "validate": {
        "my_data": {  # <- Same friendly name as in "train"
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./validate_data",  # <- Different location for the validation split
            "primary_id_field": "object_id",
        }
    },
}

Defining a split fraction#

Hyrax also supports dataset splits defined as fractions of a single directory. This is useful when all files live together and it would be impractical to move them into separate directories.

All active groups sharing the same data_location must declare split_fraction, and their values must sum to ≤ 1.0. Active data groups are the groups the current verb actually processes — when running train, that means "train" and "validate" (if present).

Fractions may sum to less than 1.0; using a small fraction is a common way to speed up iteration during early experimentation.

[4]:
data_request = {
    "train": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./all_data",
            "primary_id_field": "object_id",
            "split_fraction": 0.8,  # <- Optionally specify a split fraction to split the data into train/validate
        }
    },
    "validate": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./all_data",
            "primary_id_field": "object_id",
            "split_fraction": 0.2,  # <- The split fractions for all active groups sharing a data_location must sum to <= 1.0
        }
    },
}
[5]:
data_request = {
    "train": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./all_data",
            "primary_id_field": "object_id",
            "split_fraction": 0.01,
        }
    },
    "validate": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./all_data",
            "primary_id_field": "object_id",
            "split_fraction": 0.002,  # <- train + validate fractions sum to 0.012, well under 1.0
        }
    },
}

Requesting specific fields#

Dataset classes in Hyrax expose data via field getter methods. A field is the smallest unit of data for a given sample, such as a label or an image. Some dataset classes expose many fields — for example, HyraxCSVDataset presents each CSV column as a separate field.

By default, a data request will consume all fields a dataset class exposes. Use the fields parameter to request only the fields you need.

Note: primary_id_field is always fetched so that Hyrax can track and save results. But unless it is included in the fields list it will not be included in the data passed to models.

[6]:
data_request = {
    "train": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./all_data",
            "primary_id_field": "object_id",
            "fields": ["image", "label"],
        }
    }
}

Using multiple data sources#

The friendly name exists so that data from multiple sources can be combined in a single data group. Hyrax joins them at runtime — no pre-processing step or on-disk merge is required.

Important: When using multiple datasets, Hyrax requires them to be index-aligned. This means that for a given index i, each dataset must return data corresponding to the same object. For example, if your two datasets hold images and spectra respectively, images[i] and spectra[i] must refer to the same object.

Note that primary_id_field is only required on one “primary” dataset. Hyrax uses that dataset’s identifiers when storing results.

[7]:
data_request = {
    "train": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./all_data",
            "primary_id_field": "object_id",  # <- Now tagged as the primary dataset
            "fields": ["image", "label"],
        },
        "random_data": {
            "dataset_class": "HyraxRandomDataset",
            "data_location": "./data",
            "fields": ["image"],
        },
    }
}

Per-source dataset configuration#

Dataset class parameters are normally set once in a config file before a run. However, when the same dataset class is used for two data sources that require different parameters, you can provide those overrides inline using the dataset_config key.

In the example below, HyraxCifarDataset is used for both training and validation. For validation we need to set use_training_data = False, which would conflict with the training configuration if set globally — so we scope it to just the "validate" data source.

[8]:
data_request = {
    "train": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./data",
            "primary_id_field": "object_id",
        }
    },
    "validate": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./data",
            "primary_id_field": "object_id",
            "dataset_config": {  # <- Pass specific configurations just for this dataset
                "HyraxCifarDataset": {  # <- This matches the dataset_class value
                    "use_training_data": False,
                    # Additional dataset-specific configurations can be passed here
                }
            },
        }
    },
}

Requesting an external dataset class#

Hyrax ships with a small number of built-in dataset classes. For classes defined in an external package, provide the fully qualified import path as the dataset_class value. This is the dotted path you would use in an import statement.

For example, "my_package.datasets.my_dataset.MyDatasetClass" corresponds to:

from my_package.datasets.my_dataset import MyDatasetClass
[9]:
data_request = {
    "train": {
        "my_external_data": {
            "dataset_class": "my_package.datasets.my_dataset.MyDatasetClass",  # <- dotted path
            "data_location": "./data",
            "primary_id_field": "object_id",
        }
    }
}

Dataset config for external dataset classes#

External dataset classes should define their default configurations in a default_config.toml file. For the hypothetical class above, the defaults might look like:

[my_package.MyDatasetClass]
parameter_a = true
parameter_b = 100

We can override default values inside a data request by mirroring the same nesting under dataset_config

Note that the TOML structure is not required to match the package structure. i.e. the location of the class in the package does not have to match the nested tables in TOML.

my_package.datasets.my_dataset.MyDatasetClass (package) != my_package.MyDatasetClass (toml)
[10]:
data_request = {
    "train": {
        "my_external_data": {
            "dataset_class": "my_package.datasets.my_dataset.MyDatasetClass",
            "data_location": "./data",
            "primary_id_field": "object_id",
            "dataset_config": {
                "my_package": {  # <- Note the nesting matches table nesting in the toml file
                    "MyDatasetClass": {
                        "parameter_a": False,
                        "parameter_b": 42,
                    }
                }
            },
        }
    }
}

Converting to TOML#

When Hyrax runs a verb, the data request is serialized to TOML and saved alongside the output in runtime_config.toml. The equivalent TOML for the data request defined in the next cell is shown below.

[11]:
data_request = {
    "train": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./data",
            "primary_id_field": "object_id",
        }
    },
    "validate": {
        "my_data": {
            "dataset_class": "HyraxCifarDataset",
            "data_location": "./data",
            "primary_id_field": "object_id",
            "dataset_config": {
                "HyraxCifarDataset": {
                    "use_training_data": False,
                }
            },
        }
    },
}

The TOML representation of the data request defined above:

[data_request.train.my_data]
dataset_class = "HyraxCifarDataset"
data_location = "./data"
primary_id_field = "object_id"

[data_request.validate.my_data]
dataset_class = "HyraxCifarDataset"
data_location = "./data"
primary_id_field = "object_id"

[data_request.validate.my_data.dataset_config.HyraxCifarDataset]
use_training_data = false

Best practices#

Use the same friendly name for the same data across different data groups.#

For instance, given a data source of spectra data using for both training, validation and inference, use the same friendly name, such as “spectra_data” in each of the “train”, “validate” and “infer” groups.

Ensure that multi-modal data requests have datasets that are index aligned.#

Here index aligned means that for a given index i, each dataset must return data corresponding to the same object. For example, if two datasets hold images and spectra respectively, images[i] and spectra[i] must refer to the same object. This doesn’t necessarily mean that the datasets must be exactly the same size. The non-primary datasets can be larger than the primary dataset, but they shouldn’t be smaller.