hyrax.datasets.mmu_dataset
==========================

.. py:module:: hyrax.datasets.mmu_dataset


Classes
-------

.. autoapisummary::

   hyrax.datasets.mmu_dataset._IndexedSubset
   hyrax.datasets.mmu_dataset.MultimodalUniverseDataset


Module Contents
---------------

.. py:class:: _IndexedSubset(dataset: Any, max_samples: int)

   Fallback wrapper to enforce a max row count for indexable datasets.


   .. py:attribute:: _dataset


   .. py:attribute:: _max_samples


   .. py:method:: __getitem__(idx: int) -> Any


   .. py:method:: __len__() -> int


.. py:class:: MultimodalUniverseDataset(config: dict, data_location: pathlib.Path | str | None = None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`


   Load a MultimodalUniverse dataset through Hugging Face ``datasets``.

   This dataset class is intentionally generic so one configuration pattern can
   be used for image, spectra, and time-series MMU datasets.

   .. rubric:: Examples

   Example ``data_request`` configuration::

       {
           "infer": {
               "mmu": {
                   "dataset_class": "MultimodalUniverseDataset",
                   "data_location": "hf://MultimodalUniverse/plasticc",
                   "primary_id_field": "object_id",
                   "dataset_config": {
                       "MultimodalUniverseDataset": {
                           "split": "train",
                           "max_samples": 32,
                       }
                   },
               }
           }
       }

   .. py:method:: __init__

   Overall initialization for all Datasets which saves the config

   Subclasses of HyraxDataset ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset
       from astropy.table import Table

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: data_location
      :value: ''


   .. py:attribute:: split


   .. py:attribute:: max_samples


   .. py:attribute:: streaming


   .. py:attribute:: dataset


   .. py:attribute:: _column_name_map


   .. py:method:: _normalize_data_location(data_location: str) -> str


   .. py:method:: _load_dataset(dataset_source: str)


   .. py:method:: _limit_non_streaming_dataset(dataset: Any, max_samples: int)


   .. py:method:: _build_column_name_map() -> dict[str, str]

      Returns a map from sanitized column names to the original column names.

      It's possible for a column name to have punctuation or start with a number.
      In these cases we also allow column access via a sanitized name where all
      punctuation is replaced with the underscore character, and any field starting
      with a number is replaced by ``field_``

      Every field is entered in the dictionary regardless of whether it needed
      sanitization or not. In this case the sanitized name is exactly the field
      name.


   .. py:method:: _sanitize_name(column_name: str) -> str

      Take a column name that may contain punctuation and return a version with
      underscore replacing the punctuation


   .. py:method:: _register_getters() -> None


   .. py:method:: __len__() -> int