Providing Data - Level 1

How to tell Hyrax what data to give to a model.

Every model needs data. It learns from data during training, and it makes predictions from data during inference. In Hyrax, that flow of information happens through two main pieces: a HyraxDataset and a DataProvider. A HyraxDataset is the code that knows how to read specific data from disk. A DataProvider is the part we actually ask for data — it calls on one or more datasets, retrieves the fields we need, and hands everything back as a clean, well-structured Python dictionary.

In this guide, we’ll take our very first steps with data in Hyrax. Here’s what we’ll do:

  • Learn how to use a DataProvider to tell Hyrax what data a model should see.

  • Look inside the DataProvider to understand what the data looks like once it’s ready.

To keep things simple, we’ll practice with a built-in Dataset called HyraxRandomDataset. Think of it as “practice data” that stands in for the real thing.

As always, our first move will be to create an instance of the Hyrax class.

[1]:
from hyrax import Hyrax

h = Hyrax()

Next we’ll try to tell Hyrax that we want to use the HyraxRandomDataset as the source for our data provider.

[2]:
model_inputs_definition = {
    "train": {
        "data": {"dataset_class": "HyraxRandomDataset"},
    }
}
h.set_config("model_inputs", model_inputs_definition)

# Prepare "model_inputs"
d = h.prepare()

# Print a sample of the data
d["train"].sample_data()
/home/docs/checkouts/readthedocs.org/user_builds/hyrax/envs/v0.6.8/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
[2025-12-18 18:09:49,383 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.nn.CrossEntropyLoss.
[2025-12-18 18:09:49,384 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.optim.SGD.
[2025-12-18 18:10:00,595 hyrax.config_utils:WARNING] Cannot find default_config.toml for umap.UMAP.
[2025-12-18 18:10:00,603 hyrax.config_utils:WARNING] Runtime config contains key or section 'train' which has no default defined. All configuration keys and sections must be defined in /home/docs/checkouts/readthedocs.org/user_builds/hyrax/envs/v0.6.8/lib/python3.11/site-packages/hyrax/hyrax_default_config.toml
[2025-12-18 18:10:04,164 hyrax.prepare:INFO] Finished Prepare
[2]:
{'data': {'image': array([[[0.08925092, 0.773956  , 0.6545715 , 0.43887842, 0.43301523],
          [0.8585979 , 0.08594561, 0.697368  , 0.20146948, 0.09417731],
          [0.52647895, 0.9756223 , 0.73575234, 0.7611397 , 0.71747726],
          [0.78606427, 0.51322657, 0.12811363, 0.8397482 , 0.45038593],
          [0.5003519 , 0.370798  , 0.1825496 , 0.92676497, 0.78156745]],

         [[0.6438651 , 0.40241432, 0.8227616 , 0.5454291 , 0.44341415],
          [0.45045954, 0.22723871, 0.09213591, 0.55458474, 0.8878898 ],
          [0.0638172 , 0.85829127, 0.8276311 , 0.27675968, 0.6316644 ],
          [0.16522902, 0.7580877 , 0.70052296, 0.35452592, 0.06791997],
          [0.970698  , 0.44568747, 0.89312106, 0.677919  , 0.7783835 ]]],
        dtype=float32),
  'label': np.int64(0),
  'meta_field_1': np.float64(50.0),
  'meta_field_2': np.float64(33.333333333333336),
  'object_id': '19'}}

Hooray — success! 🎉 We just told Hyrax which Dataset to use, called h.prepare() to set it up, and printed out a sample of the data. The configuration we used is called model_inputs, because it defines what will be sent into the model. (And yes, the name is plural — models can take more than one input, but we’ll save that for the next notebook.)

What we’ve created here is the simplest possible setup: it tells Hyrax to use HyraxRandomDataset as the source of data for training. That’s all we need to get started!

Before we move on, there are two details worth highlighting:

  • The dictionary key (“data”) is up to you. You can name it whatever you like, as long as each key is unique.

  • The value of “dataset_class” is the Dataset you want Hyrax to use. Here we picked HyraxRandomDataset, but you could swap in any class.

Examine some of the data

Now that our DataProvider is set up to return only the fields we care about, let’s take a closer look at the data itself.

We’ve already seen the DataProvider.sample_data() function, which returns the first sample it can find. Because Hyrax retrieves data by index, it’s also easy to explore different parts of the dataset by sampling at random indices. This helps you get a feel for the form and structure of the data before feeding it into a model.

[3]:
d["train"][12]
[3]:
{'data': {'image': array([[[0.6162841 , 0.77899635, 0.73904294, 0.13455218, 0.8260549 ],
          [0.536068  , 0.3230834 , 0.51422286, 0.96698344, 0.85757214],
          [0.83369845, 0.4627993 , 0.7841378 , 0.38508946, 0.68616545],
          [0.63956326, 0.24979866, 0.26646328, 0.06113309, 0.13976836],
          [0.4336186 , 0.47787726, 0.16512072, 0.4168893 , 0.6041775 ]],

         [[0.23256993, 0.33823687, 0.3675118 , 0.5050539 , 0.36639243],
          [0.7369446 , 0.32749552, 0.43389672, 0.37946403, 0.18291575],
          [0.68574333, 0.23544616, 0.29687643, 0.7927519 , 0.9488579 ],
          [0.6708431 , 0.916348  , 0.8541906 , 0.48091042, 0.77029014],
          [0.32836115, 0.46049702, 0.5354348 , 0.96049297, 0.84856045]]],
        dtype=float32),
  'label': np.int64(2),
  'meta_field_1': np.float64(44.0),
  'meta_field_2': np.float64(29.333333333333332),
  'object_id': '31'}}

No big surprise — the output here looks very similar to what we saw with d.sample_data(). Remember, the data is returned as a nested dictionary, with the top-level key matching the friendly name, “data”.

You can use any integer index up to the size of the dataset, but how do we figure out that size? Simple: just use Python’s len(...) function! This lets us see exactly how many samples are available.

[4]:
len(d["train"])
[4]:
100

The primary id field

One more important option is primary_id_field — it tells Hyrax which value to include with each sample so you can trace inference outputs back to the original input.

Each dataset can provide a unique identifier per sample (a name, an index, or any value). In this dataset the identifier is returned by get_object_id, so request it by setting “primary_id_field” to “object_id”:

[5]:
model_inputs_definition = {
    "train": {
        "data": {
            "dataset_class": "HyraxRandomDataset",
            "primary_id_field": "object_id",
        },
    },
}

h.set_config("model_inputs", model_inputs_definition)

# Prepare "model_inputs"
d = h.prepare()

# Print a sample of the data
d["train"].sample_data()
[2025-12-18 18:10:04,189 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.nn.CrossEntropyLoss.
[2025-12-18 18:10:04,190 hyrax.config_utils:WARNING] Cannot find default_config.toml for torch.optim.SGD.
[2025-12-18 18:10:04,191 hyrax.config_utils:WARNING] Cannot find default_config.toml for umap.UMAP.
[2025-12-18 18:10:04,198 hyrax.config_utils:WARNING] Runtime config contains key or section 'primary_id_field' which has no default defined. All configuration keys and sections must be defined in /home/docs/checkouts/readthedocs.org/user_builds/hyrax/envs/v0.6.8/lib/python3.11/site-packages/hyrax/hyrax_default_config.toml
[2025-12-18 18:10:04,233 hyrax.prepare:INFO] Finished Prepare
[5]:
{'data': {'image': array([[[0.08925092, 0.773956  , 0.6545715 , 0.43887842, 0.43301523],
          [0.8585979 , 0.08594561, 0.697368  , 0.20146948, 0.09417731],
          [0.52647895, 0.9756223 , 0.73575234, 0.7611397 , 0.71747726],
          [0.78606427, 0.51322657, 0.12811363, 0.8397482 , 0.45038593],
          [0.5003519 , 0.370798  , 0.1825496 , 0.92676497, 0.78156745]],

         [[0.6438651 , 0.40241432, 0.8227616 , 0.5454291 , 0.44341415],
          [0.45045954, 0.22723871, 0.09213591, 0.55458474, 0.8878898 ],
          [0.0638172 , 0.85829127, 0.8276311 , 0.27675968, 0.6316644 ],
          [0.16522902, 0.7580877 , 0.70052296, 0.35452592, 0.06791997],
          [0.970698  , 0.44568747, 0.89312106, 0.677919  , 0.7783835 ]]],
        dtype=float32),
  'label': np.int64(0),
  'meta_field_1': np.float64(50.0),
  'meta_field_2': np.float64(33.333333333333336),
  'object_id': '19'},
 'object_id': '19'}

With that change the model input is fully specified; training or inference can now be run using the built-in Hyrax models (for instance h.train() or h.infer()). We’ll leave that as an exercise — training and inference are covered in more detail in a later notebook.

Persisting the configuration

In Hyrax, all configurations — including model_inputs — are saved to the configuration .toml file, along with any results. This makes it easy to reuse or share your setup later.

For our example, the saved configuration would look like this:

[model_inputs]
[model_inputs.train]
[model_inputs.train.data]
dataset_class = 'HyraxRandomDataset'
primary_id_field = 'object_id'

This ensures that Hyrax remembers exactly which Dataset and fields you want for future runs.

Recap

Great job! We covered a lot of ground in this notebook and learned the basics of providing data to models in Hyrax. Here’s a quick summary of what you accomplished:

  • Learned how to use DataProvider to supply data for your models.

  • Configured which dataset to use by updating the “model_inputs” configuration.

  • Previewed sample data and checked the dataset size to understand your data better.

  • Specified a primary ID field for traceability in your results.

  • Saw how Hyrax saves your configuration for easy reuse and sharing.

You’re now ready to set up data for training or inference with Hyrax! If you want more control of the datasets used for training and inference, “model_inputs” can accept more configurations. Checkout out Providing Data - Level 2 for more.