hyrax.datasets
==============

.. py:module:: hyrax.datasets

.. autoapi-nested-parse::

   Hyrax has several built-in datasets that you can use for astronomical data. For many uses, these datasets
   can be configured out-of-the box for a given project.

   :doc:`FitsImageDataset <fits_image_dataset/index>` is a generic container for fits image cutout data
   indexed by a user-provided catalog file. It attempts to cover common usage paradigms such as multiple images
   of the same object differentiated by telescope filter; however, extending the class as a custom dataset
   may be more well fit to advanced usage.

   :doc:`LSSTDataset <lsst_dataset/index>` Is a alpha-quality container for LSST cutout images, currently
   limited to ``deep_coadd`` type images, and restricted to run only on a Rubin observatory RSP environment
   where `LSST Pipeline <https://pipelines.lsst.io/>`_ tools and a
   `data butler <https://pipelines.lsst.io/modules/lsst.daf.butler/index.html>`_ with the appropriate images
   are available.

   :doc:`DownloadedLSSTDataset <downloaded_lsst_dataset/index>` is a subclass of LSSTDataset that generates
   cutouts from the butler and saves them as ``.pt`` files on first access. On subsequent access,
   it loads the cutouts directly from these files, which can significantly speed up data loading times.
   It inherits from LSSTDataset to access the data butler and catalog functionality.

   :doc:`HSCDataset <hsc_dataset/index>` Works similarly to FitsImageDataset, but is specialized to
   `Hyper Suprime-Cam (HSC) <https://hsc-release.mtk.nao.ac.jp/doc/index.php/data/>`_ cutout images downloaded
   with the hyrax ``download`` verb. It contains additional integrity checks and is tightly integrated with
   the ``download`` and ``rebuild_manifest`` verbs. In future this class and the downloader may become a
   separate package.

   :doc:`HyraxCifarDataset <hyrax_cifar_dataset/index>` gives access to the standard
   `CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ labeled image dataset, automatically downloading the
   dataset if it is not present. This dataset is useful for testing hyrax and occasionally individual models,
   but it is not an astronomical dataset.

   :doc:`HyraxRandomDataset <random/hyrax_random_dataset/index>` is a utility dataset that
   generates random data with a specific shape.
   This dataset makes it easy to test new models with simple random data.
   It is highly configurable such that it's possible to simulate input data for models that
   are under development.

   Each of these datasets can be used a starting point for a Custom Dataset by inheriting your custom dataset
   from e.g. `FitsImageDataset`, or you can make an entirely custom dataset following the
   :doc:`dataset class reference </dataset_class_reference>` and/or
   :doc:`dataset class notebook example </pre_executed/external_dataset_class>`.

   The remaining classes in this module exist primarily for Hyrax interface purposes:

   :doc:`InferenceDataset <inference_dataset/index>` is a dataset class that represents an ``infer`` or ``umap``
   result, and may be returned from those verbs to provide data access

   :doc:`HyraxDataset <dataset_registry/index>` is a base class for all datasets in Hyrax and must be within
   the inheritance hierarchy of all custom datasets. It is not usable on its own, but provides various fall-back
   functionality to make custom datasets easier to write. See the
   :doc:`dataset class reference </dataset_class_reference>` and
   :doc:`example notebook </pre_executed/external_dataset_class>` for more information.


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/hyrax/datasets/data_cache/index
   /autoapi/hyrax/datasets/data_provider/index
   /autoapi/hyrax/datasets/dataset_registry/index
   /autoapi/hyrax/datasets/downloaded_lsst_dataset/index
   /autoapi/hyrax/datasets/fits_image_dataset/index
   /autoapi/hyrax/datasets/hats_dataset/index
   /autoapi/hyrax/datasets/hsc_dataset/index
   /autoapi/hyrax/datasets/hyrax_cifar_dataset/index
   /autoapi/hyrax/datasets/hyrax_csv_dataset/index
   /autoapi/hyrax/datasets/inference_dataset/index
   /autoapi/hyrax/datasets/lancedb_dataset/index
   /autoapi/hyrax/datasets/lsst_dataset/index
   /autoapi/hyrax/datasets/mmu_dataset/index
   /autoapi/hyrax/datasets/nested_pandas_dataset/index
   /autoapi/hyrax/datasets/random/index
   /autoapi/hyrax/datasets/result_dataset/index
   /autoapi/hyrax/datasets/result_factories/index


Classes
-------

.. autoapisummary::

   hyrax.datasets.FitsImageDataset
   hyrax.datasets.LSSTDataset
   hyrax.datasets.DownloadedLSSTDataset
   hyrax.datasets.HSCDataset
   hyrax.datasets.HyraxCifarDataset
   hyrax.datasets.HyraxRandomDataset
   hyrax.datasets.HyraxRandomDatasetBase
   hyrax.datasets.InferenceDataset
   hyrax.datasets.ResultDataset
   hyrax.datasets.ResultDatasetWriter
   hyrax.datasets.HyraxDataset
   hyrax.datasets.HyraxCSVDataset
   hyrax.datasets.HyraxHATSDataset
   hyrax.datasets.MultimodalUniverseDataset
   hyrax.datasets.NestedPandasDataset
   hyrax.datasets.LanceDBDataset
   hyrax.datasets.DataCache


Functions
---------

.. autoapisummary::

   hyrax.datasets.create_results_writer
   hyrax.datasets.load_results_dataset


Package Contents
----------------

.. py:class:: FitsImageDataset(config: dict, data_location=None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`, :py:obj:`hyrax.datasets.dataset_registry.HyraxImageDataset`, :py:obj:`torch.utils.data.Dataset`


   Dataset for Fits Images, typically cutouts.

   .. py:method:: __init__

   Initialize a FitsImageDataset

   Most work is done in ``_init_from_path`` and functions it calls in order to allow
   subclasses to override behavior.

   :param config: Nested configuration dictionary for hyrax
   :type config: dict
   :param data_location: The directory location of the data that this dataset class will access
   :type data_location: Optional[Union[Path, str]]


   .. py:attribute:: _called_from_test
      :value: False


   .. py:attribute:: _config


   .. py:attribute:: data_location
      :value: None


   .. py:attribute:: object_id_column_name


   .. py:attribute:: filter_column_name


   .. py:attribute:: filename_column_name


   .. py:method:: _init_from_path(path: Union[pathlib.Path, str])

      __init__ helper. Initialize an HSC data set from a path. This involves several filesystem scan
      operations and will ultimately open and read the header info of every fits file in the given directory

      :param path: Path or string specifying the directory path that is the root of all filenames in the
                   catalog table
      :type path: Union[Path, str]


   .. py:method:: _set_crop_transform()

      Returns the crop transform on the image

      If overriden, subclass must:
      1) set self.cutout_shape to a tuple of ints representing the size of the cutouts that will be
      returned at some point in the init flow.

      2) Update the crop tranform using self.set_crop_transform() from the HyraxImageDataset mixin


   .. py:method:: _read_filter_catalog(filter_catalog_path: pathlib.Path | None)


   .. py:method:: _parse_filter_catalog(table) -> None

      Sets self.files by parsing the catalog.

      Subclasses may override this function to control parsing of the table more directly, but the
      overriding class must create the files dict which has type dict[object_id -> dict[filter -> filename]]
      with object_id, filter, and filename all strings.  In the case of no filter distinction, a single
      flag value may be used for the filter dict keys in the inner dicts.

      :param table: The catalog we read in
      :type table: Table


   .. py:method:: _before_preload() -> None


   .. py:method:: _prepare_metadata()


   .. py:method:: shape() -> tuple[int, int, int]

      Shape of the individual cutouts this will give to a model

      :returns: Tuple describing the dimensions of the 3 dimensional tensor handed back to models
                The first index is the number of filters
                The second index is the width of each image
                The third index is the height of each image
      :rtype: tuple[int,int,int]


   .. py:method:: __len__() -> int

      Returns number of objects in this loader

      :returns: number of objects in this data loader
      :rtype: int


   .. py:method:: get_object_id(idx: int) -> str

      Get the object ID at the given index

      :param idx: Index of the object ID to return
      :type idx: int

      :returns: The object ID at the given index
      :rtype: str


   .. py:method:: get_image(idx: int)

      Get the image at the given index as a PyTorch Tensor.

      :param idx: Index of the image to return
      :type idx: int

      :returns: The image at the given index as a PyTorch Tensor.
      :rtype: torch.Tensor


   .. py:method:: __getitem__(idx: int)


   .. py:method:: __contains__(object_id: str) -> bool

      Allows you to do `object_id in dataset` queries. Used by testing code.

      :param object_id: The object ID you'd like to know if is in the dataset
      :type object_id: str

      :returns: True of the object_id given is in the data set
      :rtype: bool


   .. py:method:: _get_file(index: int) -> pathlib.Path

      Private indexing method across all files.

      Returns the file path corresponding to the given index.

      The index is zero-based and defined in the same manner as the total order of _all_files() and
      _object_files() iterator. Useful if you have an np.array() or list built from _all_files() and you
      need to select an individual item.

      Only valid after self.object_ids, self.files, self.path, and self.num_filters have been initialized
      in __init__

      :param index: Index, see above for order semantics
      :type index: int

      :returns: The path to the file
      :rtype: Path


   .. py:method:: _all_ids(log_every=None) -> collections.abc.Generator[str]

      Private read-only iterator over all object_ids that enforces a strict total order across
      objects. Will not work prior to self.files initialization in __init__

      :Yields: *Iterator[str]* -- Object IDs currently in the dataset


   .. py:method:: _all_files()

      Private read-only iterator over all files that enforces a strict total order across
      objects and filters. Will not work prior to self.files, and self.path initialization in __init__

      :Yields: *Path* -- The path to the file.


   .. py:method:: _filter_filename(object_id)

      Private read-only iterator over all files for a given object. This enforces a strict total order
      across filters. Will not work prior to self.files initialization in __init__

      :Yields: *filter_name, file name* -- The name of a filter and the file name for the fits file.
               The file name is relative to self.path


   .. py:method:: _object_files(object_id)

      Private read-only iterator over all files for a given object. This enforces a strict total order
      across filters. Will not work prior to self.files, and self.path initialization in __init__

      :Yields: *Path* -- The path to the file.


   .. py:method:: _file_to_path(filename: str) -> pathlib.Path

      Turns a filename into a full path suitable for open. Equivalent to:

      `Path(self.path) / Path(filename)`

      :param filename: The filename string
      :type filename: str

      :returns: A full path that is openable.
      :rtype: Path


   .. py:method:: _read_object_id(object_id: str)


   .. py:method:: _apply_transforms(data: list[numpy.typing.ArrayLike])


   .. py:method:: _load_tensor_for_cache(object_id: str)

      Implementation of TensorCacheMixin abstract method.


   .. py:method:: _object_id_to_tensor(object_id: str)

      Converts an object_id to a pytorch tensor with dimensions (self.num_filters, self.cutout_shape[0],
      self.cutout_shape[1]). This is done by reading the file and slicing away any excess pixels at the
      far corners of the image from (0,0).

      The current implementation reads the files once the first time they are accessed, and then
      keeps them in a dict for future accesses.

      :param object_id: The object_id requested
      :type object_id: str

      :returns: A tensor with dimension (self.num_filters, self.cutout_shape[0], self.cutout_shape[1])
      :rtype: torch.Tensor


.. py:class:: LSSTDataset(config, data_location=None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`, :py:obj:`hyrax.datasets.dataset_registry.HyraxImageDataset`, :py:obj:`torch.utils.data.Dataset`


   LSSTDataset: A dataset to access deep_coadd images from lsst pipelines
   via the butler. Must be run in an RSP.

   .. py:method:: __init__

   Initialize the dataset with either a HATS catalog or astropy table.

   Config can specify either:
   - config["data_set"]["hats_catalog"]: path to HATS catalog
   - config["data_set"]["astropy_table"]: path to any file readable by Astropy Table


   .. py:attribute:: BANDS
      :value: ['u', 'g', 'r', 'i', 'z', 'y']


   .. py:attribute:: object_id_autodetect_names
      :value: ['object_id', 'objectId']


   .. py:attribute:: catalog


   .. py:attribute:: sh_deg


   .. py:attribute:: sw_deg


   .. py:attribute:: oid_column_name


   .. py:method:: _butler_available()


   .. py:method:: _get_butler_thread_safe()

      Thread safe butler creation

      This function ensures that there is one and only one butler created per thread
      and that threads always use their assigned butler.

      This is necessary because child classes of this one use butlers, and butler
      objects are not safe for multithreaded access.

      :returns: The butler assigned to the current thread.
      :rtype: butler


   .. py:method:: _detect_object_id_column_name()

      Setup file naming strategy based on catalog columns.


   .. py:method:: _load_catalog(data_set_config)

      Load the catalog from either a HATS catalog or an astropy table.


   .. py:method:: _load_hats_catalog(hats_path)

      Load catalog from HATS format using LSDB.


   .. py:method:: _load_astropy_catalog(table_path)

      Load catalog from astropy table format or pickled astropy table.


   .. py:method:: __len__()


   .. py:method:: get_image(idxs)

      Get image cutouts for the given indices.

      :param idxs: The index or indices of the cutouts to retrieve.
      :type idxs: int or list of int

      :returns: Single cutout tensor or list of cutout tensors.
      :rtype: list or torch.Tensor


   .. py:method:: __getitem__(idxs)

      Get default data fields for the this dataset.

      :param idxs: The index or indices of the cutouts to retrieve.
      :type idxs: int or list of int

      :returns: A dictionary containing the default data fields.
      :rtype: dict


   .. py:method:: _parse_box(patch, row)

      Return a Box2I representing the desired cutout in pixel space, given a "row" of catalog data
      which includes the semi-height (sh) and semi-width (sw) in degrees desired for the cutout.


   .. py:method:: _parse_sphere_point(row)

      Return a SpherePoint with the ra and deck given in the "row" of catalog data.
      Row must include the RA and dec as "ra" and "dec" columns respectively


   .. py:method:: _get_tract_patch(row)

      Return (tractInfo, patchInfo) for a given row.

      This function only returns the single principle tract and patch in the case of overlap.


   .. py:method:: _request_patch(tract_index, patch_index)

      Request a patch from the butler. This will be a list of
      lsst.afw.image objects each corresponding to the configured
      bands

      Uses functools.lru_cache for basic in-memory caching.


   .. py:method:: _fetch_single_cutout(row)

      Make a single cutout, returning a torch tensor.

      Does not handle edge-of-tract/patch type edge cases, will only work near
      center of a patch.


.. py:class:: DownloadedLSSTDataset(config, data_location)

   Bases: :py:obj:`hyrax.datasets.lsst_dataset.LSSTDataset`


   DownloadedLSSTDataset: A dataset that inherits from LSSTDataset and downloads
   cutouts from the LSST butler, saving them as `.pt` files during first access.
   On subsequent accesses, it loads cutouts directly from these cached files.

   This class also creates a manifest files with the shape of each cutout and the
   corresponding filename.

   Public Methods:
       download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False):
           Download cutouts with parallel processing. Automatically resumes from
           previous progress. Use max_workers to control thread count, force_retry
           to re-attempt failed downloads.

       manifest_stats():
           Returns dict with download statistics: total, successful, failed, pending
           counts and manifest file path.

       download_progress():
           Returns detailed progress metrics including completion percentage and
           failure rates.

       reset_failed_downloads():
           Resets all failed download attempts to allow retry without force_retry flag.
           Returns count of reset entries.

       save_manifest_now():
           Forces immediate manifest save (normally saved periodically during downloads).

       cache_info():
           Returns LRU cache statistics for patch fetching performance monitoring.

       clear_cache():
           Clears the patch LRU cache to free memory.

   Usage Example:
       # Initialize Hyrax
       h = hyrax.Hyrax()
       a = h.prepare()

       # Download all cutouts (resumes automatically)
       a.download_cutouts(max_workers=4)
       WARNING: The LRU Caching scheme is slightly complicated, so it is recommended to
       use the default max_workers=1 for the first download. Simply using more workers
       may not always speed up the download process.

       # Check progress
       a.download_progress()

       # Retry failed downloads
       a.download_cutouts(force_retry=True)

       # Access cutouts (loads from cache)
       cutout = a[0]  # Single cutout
       cutouts = a[0:10]  # Multiple cutouts

   File Organization:
   - Cutouts saved as: cutout_{object_id}.pt or cutout_{index:04d}.pt
   - Manifest saved as: manifest.fits (Astropy) or manifest.parquet (HATS)
   - All files stored in the data_location provided during initialization

   .. py:method:: __init__

   Initialize the dataset with either a HATS catalog or astropy table.

   Config can specify either:
   - config["data_set"]["hats_catalog"]: path to HATS catalog
   - config["data_set"]["astropy_table"]: path to any file readable by Astropy Table


   .. py:attribute:: download_dir


   .. py:attribute:: catalog_object_ids


   .. py:attribute:: _manifest_lock


   .. py:attribute:: _updates_since_save
      :value: 0


   .. py:attribute:: _save_interval
      :value: 1000


   .. py:attribute:: _band_failure_stats


   .. py:attribute:: _band_failure_lock


   .. py:attribute:: _manifest_filter_object_ids
      :value: None


   .. py:attribute:: _catalog_to_manifest_index_map
      :value: None


   .. py:attribute:: _manifest_to_catalog_index_map
      :value: None


   .. py:method:: get_objectId(idx)

      Get object ID for a given index based on naming strategy.


   .. py:method:: _setup_naming_strategy()

      Setup file naming strategy based on catalog columns.


   .. py:method:: _initialize_manifest()

      Create new manifest or load/merge with existing manifest, with band filtering validation.

      The manifest is always an astropy Table with at least the following columns:
      cutout_shape: np.array of dimensions e.g. [3,150,150]
      filename: string containing the fits filename containing the tensor for the object
      downloaded_bands: string containing a comma separated list of the bands downloaded.
      Order is expected to be consistent between rows.

      When this astropy table is loaded into memory, multiple sources are consulted.
      - The Manifest on the filesystem, which contains the source of truth for what
      files have been downloaded. If this is not found, it is created.
      - The bands given in the catalog passed in


   .. py:method:: _load_existing_manifest()

      Load existing manifest file.


   .. py:method:: _update_manifest_from_catalog(existing_manifest)

      Using object_id as a unique key, adds manifest entries to existing_manifest,
      using self.catalog as the source of any new objects.

      self.catalog is not altered by this operation.

      Entries in existing_manifest are not altered by this operation.
      New entries are added to the end of existing_manifest with a state indicating
      they have not been downloaded.


   .. py:method:: _build_catalog_to_manifest_index_map()

      Build efficient mapping from catalog indices to manifest indices.


   .. py:method:: _add_manifest_columns_to_table(table)

      Add cutout_shape, filename, and downloaded_bands columns to manifest.


   .. py:method:: _longest_object_id_idx()


   .. py:method:: _get_available_bands_from_manifest(manifest)

      Get available bands by finding entries with complete band coverage.

      Uses cutout_shape[0] to determine the expected number of bands, then finds
      entries where downloaded_bands has that many entries (i.e., complete downloads).


   .. py:method:: _setup_band_filtering(requested_bands, original_band_order)

      Setup band filtering to extract only requested bands from cached cutouts.


   .. py:method:: _get_cutout_path_from_idx(idx)

      Generate cutout file path for a given index.

      This simply applies a pattern to the filename using the object_id column.
      No guarantees are made about the file itself.


   .. py:method:: _get_cutout_path_from_manifest(idx)

      Get the cutout path by consulting the manifest

      The download thread ensures that the filename is not written to the manifest
      until all the bands that we intend to download are downloaded.

      This function is intended to be a thread safe way to get valid cutout paths.
      In the case where the file exists and is believed to be correctly downloaded
      you get a filename, but this will return None if there is some other issue.

      :param idx: The catalog index of the relevant cutout
      :type idx: int

      :returns: path to the cutout.
      :rtype: Path


   .. py:method:: _update_manifest_entry(idx, cutout_shape=None, filename='Attempted', downloaded_bands=None)

      Thread-safe manifest update with periodic saves.

      :param idx: Index in the manifest
      :param cutout_shape: Shape tuple of the cutout tensor, or None for failed downloads
      :param filename: Basename of the saved file, or "Attempted" only when ALL bands fail
      :param downloaded_bands: List of band names successfully downloaded in tensor order


   .. py:method:: _save_manifest()

      Save manifest


   .. py:method:: _sync_manifest_with_filesystem()

      Sync manifest with actual downloaded files on disk.

      This updates the manifest to reflect what is on the filesystem.
      For existing cutouts this loads every file using `torch.load`


   .. py:method:: _request_patch_cached(tract_index, patch_index, butler, skymap_name, bands_tuple)
      :staticmethod:


      Cached patch fetching using static method.

      Static method means no 'self' in cache key, making it truly global.
      Thread-safe because each call creates its own Butler instance.


   .. py:method:: _fetch_single_cutout(row, idx=None, manifest_idx=None)

      Fetch cutout, using saved cutout if available, with optional band filtering.


   .. py:method:: _fetch_cutout_with_cache(row)

      Generate cutout using cached patch fetching with NaN filling for failed bands.


   .. py:method:: __len__()

      Return length of current catalog, not the full manifest.


   .. py:method:: _get_manifest_index_for_catalog_index(catalog_idx)

      Map catalog index to manifest index. None return indicates no such item in manifest.


   .. py:method:: get_image(idxs)

      Fetch image cutout(s) for given index or indices, using caching and band filtering.

      Parameters:
      -----------
      idxs: int or slice or list
          Index or indices to fetch.

      Returns:
      --------
      torch.Tensor or list of torch.Tensor:
          Single cutout tensor or list of cutout tensors.


   .. py:method:: __getitem__(idxs) -> dict

      Modified to pass index for saving cutouts.

      Parameters:
      -----------
      idxs: int or slice or list
          Index or indices to fetch.

      Returns:
      --------
      dict:
          Dictionary with key 'data' containing another dict of default data fields
          to return. Currently only 'image' is supported.


   .. py:method:: download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False)

      Download cutouts using multiple threads with caching.

      :param indices: List of indices to download, or None for all
      :param sync_filesystem: Whether to sync manifest with existing files on disk
      :param max_workers: Maximum number of worker threads, or None to use default
      :param force_retry: Whether to retry previously failed downloads


   .. py:method:: _download_single_cutout(catalog_idx, manifest_idx)

      Helper method to download a single cutout.


   .. py:method:: cache_info()

      Get cache statistics.


   .. py:method:: clear_cache()

      Clear the LRU cache.


   .. py:method:: manifest_stats()

      Get manifest statistics including downloaded bands information.


   .. py:method:: band_filtering_info()

      Get information about current band filtering configuration.


   .. py:method:: save_manifest_now()

      Force immediate manifest save.


   .. py:method:: _determine_numprocs_download()
      :staticmethod:


      Determine number of threads for downloading.


   .. py:method:: reset_failed_downloads()

      Reset failed download attempts to allow retry.


   .. py:method:: download_progress()

      Get detailed download progress information.


   .. py:method:: download_summary()

      Get detailed download and band analysis, accounting for band filtering.


.. py:class:: HSCDataset(config: dict, data_location=None)

   Bases: :py:obj:`hyrax.datasets.fits_image_dataset.FitsImageDataset`


   Dataset for sets of HSC cutouts created by the ``fibad download`` command.

   .. py:method:: __init__


   .. py:attribute:: _called_from_test
      :value: False


   .. py:attribute:: filters_config


   .. py:method:: _read_filter_catalog(filter_catalog_path: pathlib.Path | None)


   .. py:method:: _parse_filter_catalog(table) -> None

      Sets self.files by parsing the catalog.

      Subclasses may override this function to control parsing of the table more directly, but the
      overriding class must create the files dict which has type dict[object_id -> dict[filter -> filename]]
      with object_id, filter, and filename all strings.  In the case of no filter distinction, a single
      flag value may be used for the filter dict keys in the inner dicts.

      :param table: The catalog we read in
      :type table: Table


   .. py:method:: _set_crop_transform()

      Returns the crop transform on the image

      If overriden, subclass must:
      1) set self.cutout_shape to a tuple of ints representing the size of the cutouts that will be
      returned at some point in the init flow.

      2) Update the crop tranform using self.set_crop_transform() from the HyraxImageDataset mixin


   .. py:method:: _before_preload()


   .. py:method:: _scan_file_names(filters: list[str] | None, filter_obj_ids: list[str] | None = None) -> hyrax.datasets.fits_image_dataset.files_dict

      Class initialization helper

      :param filters: List of filters that we should look for in the data corpus
      :type filters: list[str], Optional:
      :param filter_obj_ids: Filter the file scan to only file names which have the provided object IDs, skipping other files
                             When not provided, all file names in the configured data directory that match the pattern from
                             hyrax download are parsed.
      :type filter_obj_ids: list[str], Optional:

      :returns: Nested dictionary where the first level maps object_id -> dict, and the second level maps
                filter_name -> file name. Corresponds to self.files
      :rtype: dict[str,dict[str,str]]


   .. py:method:: _determine_numprocs() -> int
      :staticmethod:


   .. py:method:: _fixup_limit(nproc: int, res, est_limit, est_procs) -> int
      :staticmethod:


   .. py:method:: _scan_file_dimensions() -> dim_dict


   .. py:method:: _scan_file_dimension(processing_unit: tuple[str, list[str]]) -> tuple[str, list[tuple[int, int]]]
      :staticmethod:


   .. py:method:: _fits_file_dims(filepath) -> tuple[int, int]
      :staticmethod:


   .. py:method:: _prune_objects(filters_ref: list[str], cutout_shape: tuple[int, int] | None)

      Class initialization helper. Prunes objects from the list of objects.

      1) Removes any objects which do not have all the filters specified in filters_ref
      2) If a cutout_shape was provided in the constructor, prunes files that are too small
         for the chosen cutout size

      This function deletes from self.files and self.dims via _prune_object

      :param files: Nested dictionary where the first level maps object_id -> dict, and the second level maps
                    filter_name -> file name. This is created by _scan_files()
      :type files: dict[str,dict[str,str]]
      :param filters_ref: List of the filter names
      :type filters_ref: list[str]
      :param cutout_shape: Cutout shape tuple provided from constructor
      :type cutout_shape: tuple[int, int]


   .. py:method:: _mark_for_prune(object_id, reason)


   .. py:method:: _prune_object(object_id, reason: str)


   .. py:method:: _check_file_dimensions() -> tuple[int, int]

      Class initialization helper. Find the maximal pixel size that all images can support

      It is assumed that all the cutouts will be of very similar size; however, HSC's cutout
      server does not return exactly the same number of pixels for every query, even when it
      is given the same angular spread for every cutout.

      Machine learning models expect all images to be the same size.

      This function warns on significant differences (>2px) on any dimension between the largest
      and smallest images.

      :returns: The minimum width and height in pixels of the entire dataset. In other words: the maximal image
                size in pixels that can be generated from ALL cutout images via cropping.
      :rtype: tuple(int,int)


   .. py:method:: _rebuild_manifest(config)


   .. py:method:: __contains__(object_id: str) -> bool

      Allows you to do `object_id in dataset` queries. Used by testing code.

      :param object_id: The object ID you'd like to know if is in the dataset
      :type object_id: str

      :returns: True of the object_id given is in the data set
      :rtype: bool


   .. py:method:: _all_files_full()

      Private read-only iterator over all files that enforces a strict total order across
      objects and filters. Will not work prior to self.files, and self.path initialization in __init__

      :Yields: *Tuple[object_id, filter, filename, dim]* -- Members of this tuple are
               - The object_id as a string
               - The filter name as a string
               - The filename relative to self.path
               - A tuple containing the dimensions of the fits file in pixels.


   .. py:method:: _object_files(object_id)

      Private read-only iterator over all files for a given object. This enforces a strict total order
      across filters. Will not work prior to self.files, and self.path initialization in __init__

      Guaranteed to only return files that have filters in self.filters_ref.

      :Yields: *Path* -- The path to the file.


   .. py:method:: display(index)


.. py:class:: HyraxCifarDataset(config: dict, data_location: pathlib.Path = None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`


   Map style CIFAR 10 dataset for Hyrax

   This utilizes the CIFAR dataset from torchvision for retrieving the dataset.

   .. py:method:: __init__

   Overall initialization for all Datasets which saves the config

   Subclasses of HyraxDataset ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset
       from astropy.table import Table

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: data_location
      :value: None


   .. py:attribute:: training_data


   .. py:attribute:: cifar


   .. py:attribute:: id_width
      :value: 0


   .. py:method:: get_image(idx)

      Get the image at the given index as a NumPy array.


   .. py:method:: get_label(idx)

      Get the label at the given index.


   .. py:method:: get_object_id(idx)

      Get the object ID for the item as a string.


   .. py:method:: __len__()


.. py:class:: HyraxRandomDataset(config, data_location)

   Bases: :py:obj:`HyraxRandomDatasetBase`, :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`, :py:obj:`torch.utils.data.Dataset`


   This dataset is stand-in for a map-style dataset.
   It will produce random numpy arrays along with sequential numeric ids and,
   optionally, labels randomly selected from the provided list of possible labels.

   .. py:method:: __init__(config, data_location)

   Initialize the dataset using the parameters defined in the configuration.

   Parameter included for API consistency with other dataset classes, though
   not used by this implementation. All parameters are controlled by the following
   keys under the ``["data_set"]["HyraxRandomDataset"]`` table in the configuration:

   - ``size``: The number of random data samples to produce.
   - ``shape``: The shape of each random data sample as a tuple (e.g. (3, 29, 29) = 3
     layers of 2D data, each layer is 29x29 elements).
   - ``seed``: The random seed to use for reproducibility.
   - ``provided_labels``: A list of possible labels to randomly select from.
     If this is provided, the dataset will randomly select a label for each data sample.
   - ``metadata_fields``: A list of metadata field names. Used to create a metadata
     table with columns corresponding to each field name. All data is numeric.
   - ``number_invalid_values``: The number of invalid values to insert into the data.
   - ``invalid_value_type``: The type of invalid value to insert into the data.
     Valid values are "nan", "inf", "-inf", "none", or a float value.


   .. py:method:: __getitem__(idx: int) -> dict

      Get a data sample by index.

      The returned dictionary will contain the following keys:

      - ``index``: The index of the data sample.
      - ``object_id``: The ID of the data sample.
      - ``image``: The data sample as a numpy array.
      - ``label``: The label of the data sample (if provided).


      :param idx: The index of the data sample to retrieve.
      :type idx: int

      :returns: A dictionary containing the data sample and its metadata.
      :rtype: dict


   .. py:method:: __len__()

      Get the total number of samples in this dataset. This should be return
      the same value as the `size` parameter in the configuration.


.. py:class:: HyraxRandomDatasetBase(config, data_location)

   This is the base class for the random datasets provided by Hyrax.

   .. warning::

       Direct use of ``HyraxRandomDatasetBase`` is not advised. When working
       with Hyrax, prefer to use ``HyraxRandomDataset``.

   .. py:method:: __init__(config, data_location)

   Initialize the dataset using the parameters defined in the configuration.

   Parameter included for API consistency with other dataset classes, though
   not used by this implementation. All parameters are controlled by the following
   keys under the ``["data_set"]["HyraxRandomDataset"]`` table in the configuration:

   - ``size``: The number of random data samples to produce.
   - ``shape``: The shape of each random data sample as a tuple (e.g. (3, 29, 29) = 3
     layers of 2D data, each layer is 29x29 elements).
   - ``seed``: The random seed to use for reproducibility.
   - ``provided_labels``: A list of possible labels to randomly select from.
     If this is provided, the dataset will randomly select a label for each data sample.
   - ``metadata_fields``: A list of metadata field names. Used to create a metadata
     table with columns corresponding to each field name. All data is numeric.
   - ``number_invalid_values``: The number of invalid values to insert into the data.
   - ``invalid_value_type``: The type of invalid value to insert into the data.
     Valid values are "nan", "inf", "-inf", "none", or a float value.


   .. py:attribute:: data
      :type:  numpy.ndarray

      The random data samples produced by the dataset.


   .. py:attribute:: id_list
      :type:  list

      A list of sequential numeric IDs for each data sample.


   .. py:attribute:: provided_labels
      :type:  list

      A list of labels randomly selected from the provided list of possible labels.


   .. py:attribute:: data_location


   .. py:method:: get_image(idx: int) -> numpy.ndarray

      Get the image at the given index as a NumPy array.


   .. py:method:: get_label(idx: int) -> str

      Get the label at the given index.


   .. py:method:: get_object_id(idx: int) -> str

      Get the index of the item.


.. py:class:: InferenceDataset(config, results_dir: Union[pathlib.Path, str] | None = None, verb: str | None = None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`, :py:obj:`torch.utils.data.Dataset`


   This is a dataset class to represent the situations where we wish to treat the output of inference
   as a dataset. e.g. when performing umap/visualization operations

   Initialize an InferenceDataset object.

   As a user of this code, you should almost never create this class, Instances of this class are
   returned by the umap and infer verbs. Prefer those over creating your own.

   If you do end up creating your own class, you will need a hyrax config, and to know some things
   about where the result you are interested in is stored.

   :param config: The hyrax config dictionary
   :type config: dict
   :param results_dir: The results subdirectory of the inference or umap results you want to access, by default None.
                       If no results subdirectory is provided, this function will attempt the following in order:

                       #. Use the directory specified in ``config['results']['inference_dir']`` if set and the directory
                          exists
                       #. Look in the results configured in ``config['general']['results_dir']`` (``./results/``
                          by default), then use the most recent results directory corresponding to the verb specified.
   :type results_dir: Optional[Union[Path, str]], optional
   :param verb: The name of the verb that generated the results, only important when the most recent results
                are being fetched. If no verb is provided, "infer" will be assumed.
   :type verb: Optional[str], optional

   :raises RuntimeError: When the provided results directory is corrupt, or cannot be found.


   .. py:attribute:: results_dir


   .. py:attribute:: batch_index


   .. py:attribute:: length


   .. py:attribute:: cached_batch_num
      :type:  int | None
      :value: None


   .. py:attribute:: shape_element


   .. py:attribute:: _original_dataset_config


   .. py:attribute:: original_dataset


   .. py:method:: _shape()

      The shape of the dataset (Discovered from files)

      :returns: Tuple with the shape of an individual element of the dataset
      :rtype: Tuple


   .. py:method:: get_object_id(idx) -> str

      Returns the ID at a particular index.

      IDs are provided by the primary dataset's primary ID column.


   .. py:method:: ids() -> list[str]

      Returns the IDs of the dataset.

      IDs flow from the primary dataset and the primary ID column.

      For an InferenceDataset instance, ``self.ids()`` is canonically the same as
      ``[self.get_object_id(i) for i in range(len(self))]``.


   .. py:method:: _ids() -> collections.abc.Generator[str]

      IDs of this dataset. Will return a string generator with IDs.

      These IDs are the IDs of the dataset used originally to generate this dataset.

      :returns: Generator that yields the string ids of this dataset
      :rtype: Generator[str]

      :Yields: *Generator[str]* -- Yields the string ids of this dataset


   .. py:method:: __getitem__(idx: Union[int, numpy.ndarray])

      Implements the ``[]`` operator

      :param idx: Either an index or a numpy array of indexes.
                  These are NOT the ID values of the dataset, but rather a zero-based index starting
                  at the beginning of the inference dataset.
      :type idx: Union[int, np.ndarray]

      :returns: Either the tensor corresponding to a single result, or a tensor with a multiplicity of
                results if multiple indexes were passed.
      :rtype: torch.tensor


   .. py:method:: __len__() -> int

      Returns the length of the dataset.

      :returns: Length of the dataset.
      :rtype: int


   .. py:property:: original_config
      :type: dict


      Get the original configuration for the dataset used to generate this inference dataset

      Since this sort of dataset is definitionally an intermediate product, this returns the
      runtime config used to construct that dataset rather than this one.

      :returns: Configuration that can be used to create the original dataset that was used
                as input for whatever inference process created this dataset.
      :rtype: dict


   .. py:method:: metadata_fields() -> list[str]

      Get the metadata fields associted with the original dataset used to generate this one

      :returns: List of valid field names for metadata queries
      :rtype: list[str]


   .. py:method:: metadata(idxs: numpy.typing.ArrayLike, fields: list[str]) -> numpy.typing.ArrayLike

      Get metadata associated with the data in the InferenceDataset. This metadata comes from
      the original dataset, but is indexed according to the InferenceDataset.

      :param idxs: Indexes in the InferenceDataset for which metadata is desired
      :type idxs: npt.ArrayLike
      :param fields: Metadata fields requested
      :type fields: list[str]

      :returns: An array where the rows correspond to the passed list of indexes and the columns
                correspond to the fields passed. Order is preserved- metadata[i] corresponds to idxs[i].
      :rtype: npt.ArrayLike


   .. py:method:: _load_from_batch_file(batch_num: int, ids=Union[int, np.ndarray]) -> numpy.ndarray

      Hands back an array of tensors given a set of IDs in a particular batch and the given
      batch number


.. py:class:: ResultDataset(config: dict, data_location: Union[pathlib.Path, str])

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`


   Reader for Lance-based inference results.

   Provides HyraxQL-compatible getters to results stored in Lance format.

   Initialize the dataset.

   :param config: Hyrax configuration dictionary
   :type config: dict
   :param data_location: Path to results directory containing lance_db/
   :type data_location: Union[Path, str]


   .. py:attribute:: data_location


   .. py:attribute:: lance_dir


   .. py:attribute:: db


   .. py:attribute:: table


   .. py:attribute:: lance_dataset


   .. py:attribute:: tensor_shape


   .. py:attribute:: tensor_dtype


   .. py:method:: __len__() -> int

      Return the number of records in the dataset.


   .. py:method:: __getitem__(idx: Union[int, numpy.ndarray])

      Get data by index.

      :param idx: Single index or array of indices
      :type idx: Union[int, np.ndarray]

      :returns: Data tensor(s)
      :rtype: np.ndarray

      :raises IndexError: If index is out of range


   .. py:method:: __get_all__()

      Get all data tensors in the dataset.

      This is a specialized method that is meant for internal use (e.g. visualize_v2).
      It retrieves all tensors efficiently by assuming column names and accessing
      the array buffer directly, without creating Python objects for each row.

      :returns: All data tensors
      :rtype: np.ndarray


   .. py:method:: get_data(idx: int)

      Get data tensor at index (HyraxQL getter).

      :param idx: Index of the data item
      :type idx: int

      :returns: Data tensor
      :rtype: np.ndarray


   .. py:method:: get_object_id(idx: int) -> str

      Get object ID at index (HyraxQL getter).

      :param idx: Index of the data item
      :type idx: int

      :returns: Object ID
      :rtype: str


   .. py:method:: ids() -> list[str]

      Generate all object IDs.

      :returns: Object IDs in order
      :rtype: list[str]


.. py:class:: ResultDatasetWriter(result_dir: Union[str, pathlib.Path])

   Writer for Lance-based inference results.

   Writes inference results incrementally to Lance format using table.add()
   for each batch, avoiding memory accumulation.

   Initialize the writer.

   :param result_dir: Directory where Lance database will be created
   :type result_dir: Union[str, Path]


   .. py:attribute:: result_dir


   .. py:attribute:: lance_dir


   .. py:attribute:: db
      :value: None


   .. py:attribute:: table
      :value: None


   .. py:attribute:: schema
      :value: None


   .. py:attribute:: tensor_dtype
      :value: None


   .. py:attribute:: tensor_shape
      :value: None


   .. py:attribute:: batch_count
      :value: 0


   .. py:method:: write_batch(object_ids: numpy.ndarray, data: list[numpy.ndarray])

      Write a batch of results incrementally.

      :param object_ids: Array of object IDs (will be converted to strings)
      :type object_ids: np.ndarray
      :param data: List of numpy arrays (tensors) to write
      :type data: list[np.ndarray]


   .. py:method:: commit()

      Finalize the write by optimizing the table.


   .. py:method:: _create_schema(sample_tensor: numpy.ndarray)

      Create PyArrow schema with tensor metadata.

      :param sample_tensor: Sample tensor to determine dtype and shape
      :type sample_tensor: np.ndarray


.. py:function:: create_results_writer(result_dir: Union[str, pathlib.Path])

   Create a writer for results (Lance format).

   This factory creates a ResultDatasetWriter for writing inference results
   to Lance format. New writes always use Lance format going forward.

   :param result_dir: Directory where results should be saved
   :type result_dir: Union[str, Path]

   :returns: Writer instance for Lance storage
   :rtype: ResultDatasetWriter


.. py:function:: load_results_dataset(config: dict, results_dir: Union[pathlib.Path, str, None] = None, verb: Union[str, None] = None)

   Load a results dataset, auto-detecting format.

   This factory auto-detects whether the results are in Lance or .npy format
   and returns the appropriate dataset class.

   :param config: The hyrax config dictionary
   :type config: dict
   :param results_dir: The results subdirectory to load from
   :type results_dir: Union[Path, str, None], optional
   :param verb: The name of the verb that generated the results (for auto-discovery)
   :type verb: Union[str, None], optional

   :returns: The appropriate dataset instance based on detected format
   :rtype: Union[ResultDataset, InferenceDataset]


.. py:class:: HyraxDataset(config: dict, metadata_table=None, object_id_column_name=None)

   How to make a hyrax dataset:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset

       class MyDataset(HyraxDataset):
           def __init__(self, config: dict):
               super().__init__(config)

           def __len__(self):
               # Your len function goes here
               pass

   Optional interfaces:

   ``metadata`` -> Subclasses may pass an astropy table of metadata to ``__init__`` in the
   superclass. This table of metadata will be available through the ``metadata_fields`` and
   ``metadata`` functions.  If desired, a subclass may override these functions directly
   rather than using the astropy Table interface.

   Further documentation is in the :doc:`/pre_executed/external_dataset_class` example notebook.


   .. py:method:: __init__

   Overall initialization for all Datasets which saves the config

   Subclasses of HyraxDataset ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset
       from astropy.table import Table

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: _config


   .. py:attribute:: _metadata_table
      :value: None


   .. py:property:: config


   .. py:method:: __init_subclass__()
      :classmethod:


   .. py:method:: metadata_fields() -> list[str]

      Returns a list of metadata fields supported by this object

      :returns: The column names of the metadata table passed. Empty string if no metadata was provided at
                during construction of the HyraxDataset (or derived class).
      :rtype: list[str]


   .. py:method:: metadata(idxs: numpy.typing.ArrayLike, fields: list[str]) -> numpy.typing.ArrayLike

      Returns a table representing the metadata given an array of indexes and a list of fields.

      :param idxs: The indexes of the relevant tensor objects
      :type idxs: npt.ArrayLike
      :param fields: The names of the fields you would like returned. All values must be among those returned by
                     metadata_fields()
      :type fields: list[str]

      :returns: A numpy record array of your metadata, with only the columns specified.
                Roughly equivalent to: `metadata_table[idxs][fields].as_array()` where metadata_table is the
                astropy table that the HyraxDataset (or derived class) was constructed with.
      :rtype: npt.ArrayLike

      :raises RuntimeError: When none of the provided fields are


.. py:class:: HyraxCSVDataset(config: dict, data_location: pathlib.Path = None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`


   A Hyrax Dataset for CSV files.

   This class reads a CSV file using pandas with memory mapping enabled.
   It dynamically creates getter methods for each column in the CSV file,
   allowing users to request data from specific columns.

   .. note::

      Column names found in the CSV file are used to create the getter methods.
      If a column name contains characters that are invalid for method names,
      those characters are replaced with underscores.

   .. rubric:: Examples

   Example data_request configuration::

       {
           "train": {
               "data": {
                   "dataset_class": "HyraxCSVDataset",
                   "data_location": "</path/to/data.csv>",
                   "fields": ["<column1>", "<column2>", ...],
                   "primary_id_field": "<column name that contains a unique ID>",
               },
           },
           "validate": { "<similar to above>" },
           "infer": { "<similar to above>" },
       }

   .. py:method:: __init__

   Overall initialization for all Datasets which saves the config

   Subclasses of HyraxDataset ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset
       from astropy.table import Table

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: data_location
      :value: None


   .. py:attribute:: column_names


   .. py:attribute:: mem_mapped_csv
      :value: None


   .. py:method:: __getitem__(idx)

      Currently required by Hyrax machinery, but likely to be phased out.


   .. py:method:: __len__() -> int

      Return the number of records in the CSV.


   .. py:method:: sample_data()

      Return the first record, in dictionary form, as the sample.


.. py:class:: HyraxHATSDataset(config: dict, data_location: pathlib.Path = None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`


   Generic Hyrax dataset for HATS catalogs loaded through LSDB.

   .. rubric:: Notes

   This phase-1 implementation materializes the LSDB catalog to a pandas
   DataFrame and dynamically creates ``get_<column>`` methods for requested columns.

   .. py:method:: __init__

   Overall initialization for all Datasets which saves the config

   Subclasses of HyraxDataset ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset
       from astropy.table import Table

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: data_location
      :value: None


   .. py:attribute:: dataframe


   .. py:attribute:: column_names


   .. py:method:: _requested_columns_from_config(config: dict) -> list[str]


   .. py:method:: _open_catalog_kwargs_from_config(config: dict) -> dict


   .. py:method:: __len__() -> int


.. py:class:: MultimodalUniverseDataset(config: dict, data_location: pathlib.Path | str | None = None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`


   Load a MultimodalUniverse dataset through Hugging Face ``datasets``.

   This dataset class is intentionally generic so one configuration pattern can
   be used for image, spectra, and time-series MMU datasets.

   .. rubric:: Examples

   Example ``data_request`` configuration::

       {
           "infer": {
               "mmu": {
                   "dataset_class": "MultimodalUniverseDataset",
                   "data_location": "hf://MultimodalUniverse/plasticc",
                   "primary_id_field": "object_id",
                   "dataset_config": {
                       "MultimodalUniverseDataset": {
                           "split": "train",
                           "max_samples": 32,
                       }
                   },
               }
           }
       }

   .. py:method:: __init__

   Overall initialization for all Datasets which saves the config

   Subclasses of HyraxDataset ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset
       from astropy.table import Table

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: data_location
      :value: ''


   .. py:attribute:: split


   .. py:attribute:: max_samples


   .. py:attribute:: streaming


   .. py:attribute:: dataset


   .. py:attribute:: _column_name_map


   .. py:method:: _normalize_data_location(data_location: str) -> str


   .. py:method:: _load_dataset(dataset_source: str)


   .. py:method:: _limit_non_streaming_dataset(dataset: Any, max_samples: int)


   .. py:method:: _build_column_name_map() -> dict[str, str]

      Returns a map from sanitized column names to the original column names.

      It's possible for a column name to have punctuation or start with a number.
      In these cases we also allow column access via a sanitized name where all
      punctuation is replaced with the underscore character, and any field starting
      with a number is replaced by ``field_``

      Every field is entered in the dictionary regardless of whether it needed
      sanitization or not. In this case the sanitized name is exactly the field
      name.


   .. py:method:: _sanitize_name(column_name: str) -> str

      Take a column name that may contain punctuation and return a version with
      underscore replacing the punctuation


   .. py:method:: _register_getters() -> None


   .. py:method:: __len__() -> int


.. py:class:: NestedPandasDataset(config: dict, data_location: pathlib.Path | str | None = None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`


   A minimal Hyrax wrapper around ``nested_pandas.read_parquet``.

   .. py:method:: __init__

   Overall initialization for all Datasets which saves the config

   Subclasses of HyraxDataset ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset
       from astropy.table import Table

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: data_location
      :value: ''


   .. py:attribute:: read_kwargs


   .. py:attribute:: nested_frame


   .. py:method:: _load_nested_frame(read_kwargs: dict)


   .. py:method:: _all_available_fields() -> list[str]


   .. py:method:: _register_getters() -> None


   .. py:method:: __len__() -> int


.. py:class:: LanceDBDataset(config: dict, data_location: pathlib.Path | str | None = None)

   Bases: :py:obj:`hyrax.datasets.dataset_registry.HyraxDataset`


   A minimal Hyrax wrapper around a LanceDB table.

   .. py:method:: __init__

   Overall initialization for all Datasets which saves the config

   Subclasses of HyraxDataset ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.datasets import HyraxDataset
       from astropy.table import Table

       class MyDataset(HyraxDataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: data_location
      :value: ''


   .. py:attribute:: table_name


   .. py:attribute:: connect_kwargs


   .. py:attribute:: open_table_kwargs


   .. py:attribute:: db


   .. py:attribute:: table


   .. py:attribute:: lance_dataset


   .. py:attribute:: _row_cache
      :type:  collections.OrderedDict


   .. py:method:: _all_available_fields() -> list[str]


   .. py:method:: _get_row(idx: int)

      Return the PyArrow record-batch for *idx*, using a small FIFO row cache.

      Caching avoids redundant ``lance_dataset.take`` calls when multiple
      ``get_<field>`` accessors are invoked for the same sample index, which is
      the common pattern when DataProvider resolves all fields for a single item.
      The cache holds at most ``_ROW_CACHE_SIZE`` rows; the oldest entry is
      evicted once that limit is reached.


   .. py:method:: _resolve_table_name(configured_table_name) -> str


   .. py:method:: _register_getters() -> None


   .. py:method:: __len__() -> int


.. py:class:: DataCache(config, data_provider: hyrax.datasets.data_provider.DataProvider)

   DataCache tracks and manages a caching layer which can be used most effectively if the entirety of a
   training (or inference) epoch fits in system RAM.

   Two configs control this functionality:

   `h.config["data_set"]["use_cache"]` which determines if we are serving data dictionaries out of a cache.
   When set, the first epoch of training fills the cache with tensors, and subsequent epochs are served out
   of the cache.

   `h.config["data_set"]["preload_cache"]` starts a thread which iterates over the dataset/dataloader class
   to completion. The thread pre-loads the cache with tensors independently of the training process. The
   hope is that this thread proceeds faster than the first epoch of training and speeds up the first epoch
   as well.

   In this class we cache the output of DataProvider, before being batched. Users can control the size of
   data cached by only selecting particular fields in their data_request specification.

   The class logs to the tensorboard logger in the DataProvider (when configured).

   Initialize the DataCache with a Hyrax config.

   :param config: The Hyrax configuration that defines the data_request.
   :type config: dict
   :param data_provider: The DataProvider object which we are caching for.
   :type data_provider: DataProvider


   .. py:attribute:: _max_length


   .. py:attribute:: _resolve_data_func


   .. py:attribute:: _data_provider


   .. py:attribute:: _use_cache


   .. py:attribute:: _preload_cache


   .. py:attribute:: _data_size_bytes
      :value: 0


   .. py:attribute:: _insert_count
      :value: 0


   .. py:attribute:: logging_interval
      :value: 1000


   .. py:attribute:: _cache_map


   .. py:attribute:: _preload_thread
      :value: None


   .. py:method:: start_preload_thread()

      Start the cache preload thread if configured

      This exists to separate initialization from thread start in DataProvider's
      constructor, so the thread started can always count on a fully initialized DataProvider.


   .. py:method:: _idx_check(idx)


   .. py:method:: try_fetch(idx: int) -> dict | None

      Try to fetch a data_dict from the cache.

      :param idx: The DataProvider index of the data dict
      :type idx: int

      :returns: The data dict from the cache, None on a cache miss.
      :rtype: Optional[dict]


   .. py:method:: insert_into_cache(idx: int, data: dict[str, dict[str, Any]])

      Insert a data dict into the cache

      :param idx: Index of the data dict
      :type idx: int
      :param data: The data dict
      :type data: dict[str, dict[str, Any]]


   .. py:method:: _data_size(data, seen: set[int] | None = None) -> int
      :staticmethod:


   .. py:method:: _preload_tensor_cache()

      Preload all tensors in the dataset using multiple threads.


   .. py:method:: _lazy_map_executor(executor: concurrent.futures.Executor, idxs: collections.abc.Iterable[int])

      Lazy evaluation version of concurrent.futures.Executor.map().

      This limits memory usage during preloading by keeping only a small
      number of data dictionaries in memory at once.

      :param executor: An executor for running futures
      :type executor: concurrent.futures.Executor
      :param idxs: An iterable list of DataProvider indexes
      :type idxs: Iterable[int]

      :Yields: *Iterator[torch.Tensor]* -- An iterator over torch tensors, lazily loaded