Providing Data - Level 1

How to tell Hyrax what data to give to a model.

Every model needs data. It learns from data during training, and it makes predictions from data during inference. In Hyrax, that flow of information happens through two main pieces: a HyraxDataset and a DataProvider. A HyraxDataset is the code that knows how to read specific data from disk. A DataProvider is the part we actually ask for data — it calls on one or more datasets, retrieves the fields we need, and hands everything back as a clean, well-structured Python dictionary.

In this guide, we’ll take our very first steps with data in Hyrax. Here’s what we’ll do:

Learn how to use a DataProvider to tell Hyrax what data a model should see.
Look inside the DataProvider to understand what the data looks like once it’s ready.

To keep things simple, we’ll practice with a built-in Dataset called HyraxRandomDataset. Think of it as “practice data” that stands in for the real thing.

As always, our first move will be to create an instance of the Hyrax class.

[9]:

from hyrax import Hyrax

h = Hyrax()

[2025-09-15 15:38:05,894 hyrax:INFO] Runtime Config read from: /Users/drew/code/hyrax/src/hyrax/hyrax_default_config.toml

Next we’ll try to tell Hyrax that we want to use the HyraxRandomDataset as the source for our data provider.

[10]:

h.config["model_inputs"] = {
    "data": {
        "dataset_class": "HyraxRandomDataset",
    },
}

# Prepare the model_inputs
d = h.prepare()

# Print a sample of the data
d.sample_data()

[2025-09-15 15:38:07,037 hyrax.data_sets.data_provider:INFO] No fields were specified for 'data'. The request will be modified to select all by default. You can specify `fields` in `model_inputs`.
[2025-09-15 15:38:07,301 hyrax.prepare:INFO] Finished Prepare

[10]:

{'data': {'image': array([[[0.08925092, 0.773956  , 0.6545715 , 0.43887842, 0.43301523],
          [0.8585979 , 0.08594561, 0.697368  , 0.20146948, 0.09417731],
          [0.52647895, 0.9756223 , 0.73575234, 0.7611397 , 0.71747726],
          [0.78606427, 0.51322657, 0.12811363, 0.8397482 , 0.45038593],
          [0.5003519 , 0.370798  , 0.1825496 , 0.92676497, 0.78156745]],

         [[0.6438651 , 0.40241432, 0.8227616 , 0.5454291 , 0.44341415],
          [0.45045954, 0.22723871, 0.09213591, 0.55458474, 0.8878898 ],
          [0.0638172 , 0.85829127, 0.8276311 , 0.27675968, 0.6316644 ],
          [0.16522902, 0.7580877 , 0.70052296, 0.35452592, 0.06791997],
          [0.970698  , 0.44568747, 0.89312106, 0.677919  , 0.7783835 ]]],
        dtype=float32),
  'label': np.int64(0),
  'meta_field_1': np.float64(50.0),
  'meta_field_2': np.float64(33.333333333333336),
  'object_id': '19'}}

Hooray — success! 🎉 We just told Hyrax which Dataset to use, called h.prepare() to set it up, and printed out a sample of the data. The configuration we used is called model_inputs, because it defines what will be sent into the model. (And yes, the name is plural — models can take more than one input, but we’ll save that for he next notebook.)

What we’ve created here is the simplest possible setup: it tells Hyrax to use HyraxRandomDataset as the source of data for both training and inference. That’s all we need to get started.

Before we move on, there are two details worth highlighting:

The dictionary key (“data”) is up to you. You can name it whatever you like, as long as each key is unique.
The value of “dataset_class” is the Dataset you want Hyrax to use. Here we picked HyraxRandomDataset, but you could swap in any class.

If you only want some fields

In the minimal setup, Hyrax grabs all the fields that a Dataset can provide. That’s handy for a quick demo, but in real projects you usually don’t want everything at once. For example, you might only need images and labels, while ignoring extra metadata.

Fortunately, you’re in control. The DataProvider can show you exactly which fields are available, and the model_inputs configuration lets you pick and choose the ones you actually want.

[11]:

d.fields()

[11]:

{'data': ['image', 'label', 'meta_field_1', 'meta_field_2', 'object_id']}

[12]:

h.config["model_inputs"] = {
    "data": {
        "dataset_class": "HyraxRandomDataset",
        "fields": ["image", "meta_field_2"],  # <- Request only specific fields.
    },
}

# Prepare the model_inputs
d = h.prepare()

# Print a sample of the data
d.sample_data()

[2025-09-15 15:49:15,867 hyrax.prepare:INFO] Finished Prepare

[12]:

{'data': {'image': array([[[0.08925092, 0.773956  , 0.6545715 , 0.43887842, 0.43301523],
          [0.8585979 , 0.08594561, 0.697368  , 0.20146948, 0.09417731],
          [0.52647895, 0.9756223 , 0.73575234, 0.7611397 , 0.71747726],
          [0.78606427, 0.51322657, 0.12811363, 0.8397482 , 0.45038593],
          [0.5003519 , 0.370798  , 0.1825496 , 0.92676497, 0.78156745]],

         [[0.6438651 , 0.40241432, 0.8227616 , 0.5454291 , 0.44341415],
          [0.45045954, 0.22723871, 0.09213591, 0.55458474, 0.8878898 ],
          [0.0638172 , 0.85829127, 0.8276311 , 0.27675968, 0.6316644 ],
          [0.16522902, 0.7580877 , 0.70052296, 0.35452592, 0.06791997],
          [0.970698  , 0.44568747, 0.89312106, 0.677919  , 0.7783835 ]]],
        dtype=float32),
  'meta_field_2': np.float64(33.333333333333336)}}

Huzzah! 🎉 We just looked at all the fields that HyraxRandomDataset provides, and then updated model_inputs to request only “image” and “meta_field_2”. Now our DataProvider returns exactly the fields we want — no extras.

Examine some of the data

Now that our DataProvider is set up to return only the fields we care about, let’s take a closer look at the data itself.

We’ve already seen the DataProvider.sample_data() function, which returns the first sample it can find. Because Hyrax retrieves data by index, it’s also easy to explore different parts of the dataset by sampling at random indices. This helps you get a feel for the form and structure of the data before feeding it into a model.

[13]:

d[12]

[13]:

{'data': {'image': array([[[0.6162841 , 0.77899635, 0.73904294, 0.13455218, 0.8260549 ],
          [0.536068  , 0.3230834 , 0.51422286, 0.96698344, 0.85757214],
          [0.83369845, 0.4627993 , 0.7841378 , 0.38508946, 0.68616545],
          [0.63956326, 0.24979866, 0.26646328, 0.06113309, 0.13976836],
          [0.4336186 , 0.47787726, 0.16512072, 0.4168893 , 0.6041775 ]],

         [[0.23256993, 0.33823687, 0.3675118 , 0.5050539 , 0.36639243],
          [0.7369446 , 0.32749552, 0.43389672, 0.37946403, 0.18291575],
          [0.68574333, 0.23544616, 0.29687643, 0.7927519 , 0.9488579 ],
          [0.6708431 , 0.916348  , 0.8541906 , 0.48091042, 0.77029014],
          [0.32836115, 0.46049702, 0.5354348 , 0.96049297, 0.84856045]]],
        dtype=float32),
  'meta_field_2': np.float64(29.333333333333332)}}

No big surprise — the output here looks very similar to what we saw with d.sample_data(). Remember, the data is returned as a nested dictionary, with the top-level key matching the friendly name, “data”.

You can use any integer index up to the size of the dataset, but how do we figure out that size? Simple: just use Python’s len(...) function! This lets us see exactly how many samples are available.

[14]:

len(d)

[14]:

Persisting the configuration

In Hyrax, all configurations — including model_inputs — are saved to the configuration .toml file, along with any results. This makes it easy to reuse or share your setup later.

For our example, the saved configuration would look like this:

[model_inputs]
[model_inputs.data]
dataset_class: 'HyraxRandomDataset'
fields: ['image', 'meta_field_2']

This ensures that Hyrax remembers exactly which Dataset and fields you want for future runs.