hyrax.datasets.mmu_dataset

hyrax.datasets.mmu_dataset#

Classes#

`_IndexedSubset`	Fallback wrapper to enforce a max row count for indexable datasets.
`MultimodalUniverseDataset`	Load a MultimodalUniverse dataset through Hugging Face `datasets`.

Module Contents#

class _IndexedSubset(dataset: Any, max_samples: int)[source]#

Fallback wrapper to enforce a max row count for indexable datasets.

_dataset[source]#

_max_samples[source]#

__getitem__(idx: int) → Any[source]#

__len__() → int[source]#

class MultimodalUniverseDataset(config: dict, data_location: pathlib.Path | str | None = None)[source]#

Bases: hyrax.datasets.dataset_registry.HyraxDataset

Load a MultimodalUniverse dataset through Hugging Face datasets.

This dataset class is intentionally generic so one configuration pattern can be used for image, spectra, and time-series MMU datasets.

Examples

Example data_request configuration:

{
    "infer": {
        "mmu": {
            "dataset_class": "MultimodalUniverseDataset",
            "data_location": "hf://MultimodalUniverse/plasticc",
            "primary_id_field": "object_id",
            "dataset_config": {
                "MultimodalUniverseDataset": {
                    "split": "train",
                    "max_samples": 32,
                }
            },
        }
    }
}

__init__()[source]#

Overall initialization for all Datasets which saves the config

Subclasses of HyraxDataset ought call this at the end of their __init__ like:

from hyrax.datasets import HyraxDataset

class MyDataset(HyraxDataset):
    def __init__(config):
        <your code>
        super().__init__(config)

If per tensor metadata is available, it is recommended that dataset authors create an astropy Table of that data, in the same order as their data and pass that metadata_table as shown below:

from hyrax.datasets import HyraxDataset
from astropy.table import Table

class MyDataset(HyraxDataset):
    def __init__(config):
        <your code>
        metadata_table = Table(<Your catalog data goes here>)
        super().__init__(config, metadata_table)

Parameters:

config (dict, Optional) – The runtime configuration for hyrax
metadata_table (Optional[Table], optional) – An Astropy Table with 1. the metadata columns desired for visualization AND 2. in the order your data will be enumerated.
object_id_column_name (Optional[str], optional) – The name of the column containing object IDs. If None, uses the default from config or creates one from the ids() method.

data_location = ''[source]#

split[source]#

max_samples[source]#

streaming[source]#

dataset[source]#

_column_name_map[source]#

_normalize_data_location(data_location: str) → str[source]#

_load_dataset(dataset_source: str)[source]#

_limit_non_streaming_dataset(dataset: Any, max_samples: int)[source]#

_build_column_name_map() → dict[str, str][source]#

Returns a map from sanitized column names to the original column names.

It’s possible for a column name to have punctuation or start with a number. In these cases we also allow column access via a sanitized name where all punctuation is replaced with the underscore character, and any field starting with a number is replaced by field_

Every field is entered in the dictionary regardless of whether it needed sanitization or not. In this case the sanitized name is exactly the field name.

_sanitize_name(column_name: str) → str[source]#: Take a column name that may contain punctuation and return a version with underscore replacing the punctuation

_register_getters() → None[source]#

__len__() → int[source]#