hyrax.datasets.mmu_dataset#
Classes#
Fallback wrapper to enforce a max row count for indexable datasets. |
|
Load a MultimodalUniverse dataset through Hugging Face |
Module Contents#
- class _IndexedSubset(dataset: Any, max_samples: int)[source]#
Fallback wrapper to enforce a max row count for indexable datasets.
- class MultimodalUniverseDataset(config: dict, data_location: pathlib.Path | str | None = None)[source]#
Bases:
hyrax.datasets.dataset_registry.HyraxDatasetLoad a MultimodalUniverse dataset through Hugging Face
datasets.This dataset class is intentionally generic so one configuration pattern can be used for image, spectra, and time-series MMU datasets.
Examples
Example
data_requestconfiguration:{ "infer": { "mmu": { "dataset_class": "MultimodalUniverseDataset", "data_location": "hf://MultimodalUniverse/plasticc", "primary_id_field": "object_id", "dataset_config": { "MultimodalUniverseDataset": { "split": "train", "max_samples": 32, } }, } } }
Overall initialization for all Datasets which saves the config
Subclasses of HyraxDataset ought call this at the end of their __init__ like:
from hyrax.datasets import HyraxDataset class MyDataset(HyraxDataset): def __init__(config): <your code> super().__init__(config)
If per tensor metadata is available, it is recommended that dataset authors create an astropy Table of that data, in the same order as their data and pass that metadata_table as shown below:
from hyrax.datasets import HyraxDataset from astropy.table import Table class MyDataset(HyraxDataset): def __init__(config): <your code> metadata_table = Table(<Your catalog data goes here>) super().__init__(config, metadata_table)
- Parameters:
config (dict, Optional) – The runtime configuration for hyrax
metadata_table (Optional[Table], optional) – An Astropy Table with 1. the metadata columns desired for visualization AND 2. in the order your data will be enumerated.
object_id_column_name (Optional[str], optional) – The name of the column containing object IDs. If None, uses the default from config or creates one from the ids() method.
- _build_column_name_map() dict[str, str][source]#
Returns a map from sanitized column names to the original column names.
It’s possible for a column name to have punctuation or start with a number. In these cases we also allow column access via a sanitized name where all punctuation is replaced with the underscore character, and any field starting with a number is replaced by
field_Every field is entered in the dictionary regardless of whether it needed sanitization or not. In this case the sanitized name is exactly the field name.