Providing Data - Level 2
Fine grained control over requested data
In this notebook we’ll explore a few more options you can put in model_inputs to control exactly what data your model receives.
We’ll continue with the built-in HyraxRandomDataset from Level 1 to demonstrate passing dataset-specific options via dataset_config. After that we’ll switch to HyraxCifar to show how to use data_location and fields so you can request only the pieces you need.
As always, start by creating an instance of the Hyrax class, then preparing the same dataset that was used at the end of Level 1.
[1]:
from hyrax import Hyrax
h = Hyrax()
model_inputs_definition = {
"train": {
"data": {
"dataset_class": "HyraxRandomDataset",
"primary_id_field": "object_id",
},
},
}
h.set_config("model_inputs", model_inputs_definition)
# Prepare "model_inputs"
d = h.prepare()
# Print a sample of the data
d["train"].sample_data()
/home/docs/checkouts/readthedocs.org/user_builds/hyrax/envs/v0.6.7/lib/python3.10/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
[2025-11-12 22:23:44,609 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.nn.CrossEntropyLoss.
[2025-11-12 22:23:44,610 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.optim.SGD.
[2025-11-12 22:23:50,165 hyrax.config_utils:WARNING] Cannot find default_config.toml for umap.UMAP.
[2025-11-12 22:23:50,178 hyrax.config_utils:WARNING] Runtime config contains key or section 'train' which has no default defined. All configuration keys and sections must be defined in /home/docs/checkouts/readthedocs.org/user_builds/hyrax/envs/v0.6.7/lib/python3.10/site-packages/hyrax/hyrax_default_config.toml
[2025-11-12 22:23:52,978 hyrax.prepare:INFO] Finished Prepare
[1]:
{'data': {'image': array([[[0.08925092, 0.773956 , 0.6545715 , 0.43887842, 0.43301523],
[0.8585979 , 0.08594561, 0.697368 , 0.20146948, 0.09417731],
[0.52647895, 0.9756223 , 0.73575234, 0.7611397 , 0.71747726],
[0.78606427, 0.51322657, 0.12811363, 0.8397482 , 0.45038593],
[0.5003519 , 0.370798 , 0.1825496 , 0.92676497, 0.78156745]],
[[0.6438651 , 0.40241432, 0.8227616 , 0.5454291 , 0.44341415],
[0.45045954, 0.22723871, 0.09213591, 0.55458474, 0.8878898 ],
[0.0638172 , 0.85829127, 0.8276311 , 0.27675968, 0.6316644 ],
[0.16522902, 0.7580877 , 0.70052296, 0.35452592, 0.06791997],
[0.970698 , 0.44568747, 0.89312106, 0.677919 , 0.7783835 ]]],
dtype=float32),
'label': np.int64(0),
'meta_field_1': np.float64(50.0),
'meta_field_2': np.float64(33.333333333333336),
'object_id': '19'},
'object_id': '19'}
Dataset-specific configurations
The HyraxRandomDataset is very flexible, and has several configuration settings. Let’s see how we can set the shape of the returned “image” using “model_inputs”.
[2]:
model_inputs_definition = {
"train": {
"data": {
"dataset_class": "HyraxRandomDataset",
"primary_id_field": "object_id",
"dataset_config": {
"shape": (3, 2, 4), # <- Change the shape
},
},
}
}
h.set_config("model_inputs", model_inputs_definition)
d = h.prepare()
sample = d["train"].sample_data()
print(f"Image shape: {sample['data']['image'].shape}")
[2025-11-12 22:23:52,992 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.nn.CrossEntropyLoss.
[2025-11-12 22:23:52,994 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.optim.SGD.
[2025-11-12 22:23:52,996 hyrax.config_utils:WARNING] Cannot find default_config.toml for umap.UMAP.
[2025-11-12 22:23:53,006 hyrax.config_utils:WARNING] Runtime config contains key or section 'dataset_config' which has no default defined. All configuration keys and sections must be defined in /home/docs/checkouts/readthedocs.org/user_builds/hyrax/envs/v0.6.7/lib/python3.10/site-packages/hyrax/hyrax_default_config.toml
[2025-11-12 22:23:53,065 hyrax.prepare:INFO] Finished Prepare
Image shape: (3, 2, 4)
Any other dataset configuration parameters can be set in the "dataset_config" dictionary. While this may seem redundant (since you can set these same values elsewhere in the config) the real power of this show in Providing Data - Level 3 when we request data from multiple datasets at once.
Defining the data location
So far, the built-in HyraxRandomDataset has been used as a lightweight stand‑in for real data — it lives only in memory and is great for quick experiments. Now we’ll switch to the HyraxCifarDataset and learn how to tell Hyrax where the data lives using the “data_location” parameter. This allows Hyrax to load examples from disk, instead of generating them on the fly, which is what you’ll do in real projects.
Note - We’ll need to download the CiFAR dataset for this, it’s about 170Mb.
[3]:
model_inputs_definition = {
"train": {
"data": {
"dataset_class": "HyraxCifarDataSet",
"data_location": "./data", # <- Define where to find the data
"primary_id_field": "object_id",
},
},
}
h.set_config("model_inputs", model_inputs_definition)
d = h.prepare()
d["train"].sample_data()
[2025-11-12 22:23:53,074 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.nn.CrossEntropyLoss.
[2025-11-12 22:23:53,075 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.optim.SGD.
[2025-11-12 22:23:53,077 hyrax.config_utils:WARNING] Cannot find default_config.toml for umap.UMAP.
[2025-11-12 22:23:53,088 hyrax.config_utils:WARNING] Runtime config contains key or section 'data_location' which has no default defined. All configuration keys and sections must be defined in /home/docs/checkouts/readthedocs.org/user_builds/hyrax/envs/v0.6.7/lib/python3.10/site-packages/hyrax/hyrax_default_config.toml
100%|██████████| 170M/170M [00:12<00:00, 13.2MB/s]
[2025-11-12 22:24:17,366 hyrax.prepare:INFO] Finished Prepare
[3]:
{'data': {'image': array([[[-0.5372549 , -0.6627451 , -0.60784316, ..., 0.23921573,
0.19215691, 0.16078436],
[-0.8745098 , -1. , -0.85882354, ..., -0.03529412,
-0.06666666, -0.04313725],
[-0.8039216 , -0.8745098 , -0.6156863 , ..., -0.0745098 ,
-0.05882353, -0.14509803],
...,
[ 0.6313726 , 0.5764706 , 0.5529412 , ..., 0.254902 ,
-0.56078434, -0.58431375],
[ 0.41176474, 0.35686278, 0.45882356, ..., 0.4431373 ,
-0.23921567, -0.3490196 ],
[ 0.38823533, 0.3176471 , 0.4039216 , ..., 0.69411767,
0.18431377, -0.03529412]],
[[-0.5137255 , -0.6392157 , -0.62352943, ..., 0.03529418,
-0.01960784, -0.02745098],
[-0.84313726, -1. , -0.9372549 , ..., -0.3098039 ,
-0.3490196 , -0.31764704],
[-0.8117647 , -0.94509804, -0.7882353 , ..., -0.34117645,
-0.34117645, -0.42745095],
...,
[ 0.33333337, 0.20000005, 0.26274514, ..., 0.04313731,
-0.75686276, -0.73333335],
[ 0.09019613, -0.03529412, 0.12941182, ..., 0.16078436,
-0.5137255 , -0.58431375],
[ 0.12941182, 0.01176476, 0.11372554, ..., 0.4431373 ,
-0.0745098 , -0.27843136]],
[[-0.5058824 , -0.64705884, -0.6627451 , ..., -0.15294117,
-0.19999999, -0.19215685],
[-0.84313726, -1. , -1. , ..., -0.5686275 ,
-0.60784316, -0.5529412 ],
[-0.8352941 , -1. , -0.9372549 , ..., -0.60784316,
-0.60784316, -0.67058825],
...,
[-0.24705881, -0.73333335, -0.79607844, ..., -0.45098037,
-0.94509804, -0.84313726],
[-0.24705881, -0.67058825, -0.7647059 , ..., -0.26274508,
-0.73333335, -0.73333335],
[-0.09019607, -0.26274508, -0.31764704, ..., 0.09803927,
-0.34117645, -0.4352941 ]]], shape=(3, 32, 32), dtype=float32),
'index': 0,
'label': 6,
'object_id': 0},
'object_id': 0}
If you only want some fields
In the minimal setup, Hyrax grabs all the fields that a dataset can provide. That’s handy for a quick demo, but in real projects you usually don’t want everything. For example, you might only need images and labels, while ignoring extra metadata.
Fortunately, you’re in control. The DataProvider can show you exactly which fields are available, and the model_inputs configuration lets you pick and choose the ones you actually want.
[4]:
d["train"].fields()
[4]:
{'data': ['image', 'index', 'label', 'object_id']}
Note that fields() will always return all the fields that the dataset exposes.
[5]:
model_inputs_definition = {
"train": {
"data": {
"dataset_class": "HyraxCifarDataSet",
"data_location": "./data", # <- Define where to find the data
"primary_id_field": "object_id",
"fields": ["label"], # <- Request only this specific field.
},
},
}
h.set_config("model_inputs", model_inputs_definition)
d = h.prepare()
d["train"].sample_data()
[2025-11-12 22:24:17,386 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.nn.CrossEntropyLoss.
[2025-11-12 22:24:17,387 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.optim.SGD.
[2025-11-12 22:24:17,388 hyrax.config_utils:WARNING] Cannot find default_config.toml for umap.UMAP.
[2025-11-12 22:24:17,399 hyrax.config_utils:WARNING] Runtime config contains key or section 'fields' which has no default defined. All configuration keys and sections must be defined in /home/docs/checkouts/readthedocs.org/user_builds/hyrax/envs/v0.6.7/lib/python3.10/site-packages/hyrax/hyrax_default_config.toml
[2025-11-12 22:24:27,325 hyrax.prepare:INFO] Finished Prepare
[5]:
{'data': {'label': 6}, 'object_id': 0}
Huzzah! 🎉 We just looked at all the fields that HyraxCifarDataSet can provide, and then updated model_inputs to request only “label”. Now our DataProvider returns exactly the fields we want — no extras.
Recap
In this notebook, we explored how to control the data provided to your models using Hyrax’s flexible configuration system. You learned how to specify exactly what data you want, where it comes from, and which fields are included, giving you fine-grained control over your data pipeline. In this notebook, we:
Used
HyraxRandomDatasetand set dataset-specific options viadataset_configSwitched to
HyraxCifarDataSetand specified the data location withdata_locationInspected available fields with
fields()Selected only the fields needed for your workflow using the
fieldsparameter inmodel_inputs
Next, work through the “Providing Data - Level 3” notebook to learn how to use model_inputs and DataProvider to specify dataset splits, provide data for inference and request multimodal data for your models!