hyrax.data_sets
===============

.. py:module:: hyrax.data_sets

.. autoapi-nested-parse::

   Hyrax has several built-in datasets that you can use for astronomical data. For many uses, these datasets
   can be configured out-of-the box for a given project.

   :doc:`FitsImageDataSet <fits_image_dataset/index>` is a generic container for fits image cutout data
   indexed by a user-provided catalog file. It attempts to cover common usage paradigms such as multiple images
   of the same object differentiated by telescope filter; however, extending the class as a custom dataset
   may be more well fit to advanced usage.

   :doc:`LSSTDataset <lsst_dataset/index>` Is a alpha-quality container for LSST cutout images, currently
   limited to ``deep_coadd`` type images, and restricted to run only on a Rubin observatory RSP environment
   where `LSST Pipeline <https://pipelines.lsst.io/>`_ tools and a
   `data butler <https://pipelines.lsst.io/modules/lsst.daf.butler/index.html>`_ with the appropriate images
   are available.

   :doc:`DownloadedLSSTDataset <downloaded_lsst_dataset/index>` is a subclass of LSSTDataset that generates
   cutouts from the butler and saves them as ``.pt`` files on first access. On subsequent access,
   it loads the cutouts directly from these files, which can significantly speed up data loading times.
   It inherits from LSSTDataset to access the data butler and catalog functionality.

   :doc:`HSCDataSet <hsc_data_set/index>` Works similarly to FitsImageDataSet, but is specialized to
   `Hyper Suprime-Cam (HSC) <https://hsc-release.mtk.nao.ac.jp/doc/index.php/data/>`_ cutout images downloaded
   with the hyrax ``download`` verb. It contains additional integrity checks and is tightly integrated with
   the ``download`` and ``rebuild_manifest`` verbs. In future this class and the downloader may become a
   separate package.

   :doc:`HyraxCifarDataset <hyrax_cifar_dataset/index>` and
   :doc:`HyraxCifarIterableDataset <hyrax_cifar_dataset/index>` give access to the standard
   `CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ labeled image dataset, automatically downloading the
   dataset if it is not present. These datasets are useful for testing hyrax and occasionally individual models,
   but they are not astronomical datasets.

   :doc:`HyraxRandomDataset <random/hyrax_random_dataset/index>` and
   :doc:`HyraxRandomIterableDataset <random/hyrax_random_dataset/index>` are utility datasets that
   generate random data with a specific shape.
   These datasets make it easy to test new models with simple random data.
   They are highly configurable such that it's possible to simulate input data for models that
   are under development.

   Each of these datasets can be used a starting point for a Custom Dataset by inheriting your custom dataset
   from e.g. `FitsImageDataSet`, or you can make an entirely custom dataset following the
   :ref:`custom dataset instructions <custom-dataset-instructions>` and/or
   :doc:`custom dataset example notebook </pre_executed/custom_dataset>`.

   The remaining classes in this module exist primarily for Hyrax interface purposes:

   :doc:`InferenceDataset <inference_dataset/index>` is a dataset class that represents an ``infer`` or ``umap``
   result, and may be returned from those verbs to provide data access

   :doc:`HyraxDataset <data_set_registry/index>` is a base class for all datasets in Hyrax and must be within
   the inheretence hierarchy of all custom datasets. It is not usable on it's own, but provides various fall-back
   functionality to make custom datasets easier to write. See the
   :ref:`custom dataset instructions <custom-dataset-instructions>` and
   :doc:`example notebook </pre_executed/custom_dataset>` for more information.


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/hyrax/data_sets/data_provider/index
   /autoapi/hyrax/data_sets/data_set_registry/index
   /autoapi/hyrax/data_sets/downloaded_lsst_dataset/index
   /autoapi/hyrax/data_sets/fits_image_dataset/index
   /autoapi/hyrax/data_sets/hsc_data_set/index
   /autoapi/hyrax/data_sets/hyrax_cifar_dataset/index
   /autoapi/hyrax/data_sets/hyrax_csv_dataset/index
   /autoapi/hyrax/data_sets/inference_dataset/index
   /autoapi/hyrax/data_sets/lsst_dataset/index
   /autoapi/hyrax/data_sets/random/index
   /autoapi/hyrax/data_sets/tensor_cache_mixin/index


Classes
-------

.. autoapisummary::

   hyrax.data_sets.FitsImageDataSet
   hyrax.data_sets.LSSTDataset
   hyrax.data_sets.DownloadedLSSTDataset
   hyrax.data_sets.HSCDataSet
   hyrax.data_sets.HyraxCifarDataset
   hyrax.data_sets.HyraxCifarIterableDataset
   hyrax.data_sets.HyraxRandomDataset
   hyrax.data_sets.HyraxRandomIterableDataset
   hyrax.data_sets.HyraxRandomDatasetBase
   hyrax.data_sets.InferenceDataSet
   hyrax.data_sets.HyraxDataset
   hyrax.data_sets.HyraxCifarBase
   hyrax.data_sets.HyraxCSVDataset


Functions
---------

.. autoapisummary::

   hyrax.data_sets.iterable_dataset_collate


Package Contents
----------------

.. py:class:: FitsImageDataSet(config: dict, data_location=None)

   Bases: :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`, :py:obj:`hyrax.data_sets.data_set_registry.HyraxImageDataset`, :py:obj:`hyrax.data_sets.tensor_cache_mixin.TensorCacheMixin`, :py:obj:`torch.utils.data.Dataset`


   Dataset for Fits Images, typically cutouts.

   .. py:method:: __init__

   Initialize a FitsImageDataSet

   Most work is done in ``_init_from_path`` and functions it calls in order to allow
   subclasses to override behavior.

   :param config: Nested configuration dictionary for hyrax
   :type config: dict
   :param data_location: The directory location of the data that this dataset class will access
   :type data_location: Optional[Union[Path, str]]


   .. py:attribute:: _called_from_test
      :value: False


   .. py:attribute:: _config


   .. py:attribute:: object_id_column_name


   .. py:attribute:: filter_column_name


   .. py:attribute:: filename_column_name


   .. py:method:: _init_from_path(path: Union[pathlib.Path, str])

      __init__ helper. Initialize an HSC data set from a path. This involves several filesystem scan
      operations and will ultimately open and read the header info of every fits file in the given directory

      :param path: Path or string specifying the directory path that is the root of all filenames in the
                   catalog table
      :type path: Union[Path, str]


   .. py:method:: _set_crop_transform()

      Returns the crop transform on the image

      If overriden, subclass must:
      1) set self.cutout_shape to a tuple of ints representing the size of the cutouts that will be
      returned at some point in the init flow.

      2) Update the crop tranform using self.set_crop_transform() from the HyraxImageDataset mixin


   .. py:method:: _read_filter_catalog(filter_catalog_path: pathlib.Path | None)


   .. py:method:: _parse_filter_catalog(table) -> None

      Sets self.files by parsing the catalog.

      Subclasses may override this function to control parsing of the table more directly, but the
      overriding class must create the files dict which has type dict[object_id -> dict[filter -> filename]]
      with object_id, filter, and filename all strings.  In the case of no filter distinction, a single
      flag value may be used for the filter dict keys in the inner dicts.

      :param table: The catalog we read in
      :type table: Table


   .. py:method:: _before_preload() -> None


   .. py:method:: _prepare_metadata()


   .. py:method:: shape() -> tuple[int, int, int]

      Shape of the individual cutouts this will give to a model

      :returns: Tuple describing the dimensions of the 3 dimensional tensor handed back to models
                The first index is the number of filters
                The second index is the width of each image
                The third index is the height of each image
      :rtype: tuple[int,int,int]


   .. py:method:: __len__() -> int

      Returns number of objects in this loader

      :returns: number of objects in this data loader
      :rtype: int


   .. py:method:: get_object_id(idx: int) -> str

      Get the object ID at the given index

      :param idx: Index of the object ID to return
      :type idx: int

      :returns: The object ID at the given index
      :rtype: str


   .. py:method:: get_image(idx: int)

      Get the image at the given index as a PyTorch Tensor.

      :param idx: Index of the image to return
      :type idx: int

      :returns: The image at the given index as a PyTorch Tensor.
      :rtype: torch.Tensor


   .. py:method:: __getitem__(idx: int)


   .. py:method:: __contains__(object_id: str) -> bool

      Allows you to do `object_id in dataset` queries. Used by testing code.

      :param object_id: The object ID you'd like to know if is in the dataset
      :type object_id: str

      :returns: True of the object_id given is in the data set
      :rtype: bool


   .. py:method:: _get_file(index: int) -> pathlib.Path

      Private indexing method across all files.

      Returns the file path corresponding to the given index.

      The index is zero-based and defined in the same manner as the total order of _all_files() and
      _object_files() iterator. Useful if you have an np.array() or list built from _all_files() and you
      need to select an individual item.

      Only valid after self.object_ids, self.files, self.path, and self.num_filters have been initialized
      in __init__

      :param index: Index, see above for order semantics
      :type index: int

      :returns: The path to the file
      :rtype: Path


   .. py:method:: ids(log_every=None) -> collections.abc.Generator[str]

      Public read-only iterator over all object_ids that enforces a strict total order across
      objects. Will not work prior to self.files initialization in __init__

      :Yields: *Iterator[str]* -- Object IDs currently in the dataset


   .. py:method:: _all_files()

      Private read-only iterator over all files that enforces a strict total order across
      objects and filters. Will not work prior to self.files, and self.path initialization in __init__

      :Yields: *Path* -- The path to the file.


   .. py:method:: _filter_filename(object_id)

      Private read-only iterator over all files for a given object. This enforces a strict total order
      across filters. Will not work prior to self.files initialization in __init__

      :Yields: *filter_name, file name* -- The name of a filter and the file name for the fits file.
               The file name is relative to self.path


   .. py:method:: _object_files(object_id)

      Private read-only iterator over all files for a given object. This enforces a strict total order
      across filters. Will not work prior to self.files, and self.path initialization in __init__

      :Yields: *Path* -- The path to the file.


   .. py:method:: _file_to_path(filename: str) -> pathlib.Path

      Turns a filename into a full path suitable for open. Equivalent to:

      `Path(self.path) / Path(filename)`

      :param filename: The filename string
      :type filename: str

      :returns: A full path that is openable.
      :rtype: Path


   .. py:method:: _read_object_id(object_id: str)


   .. py:method:: _convert_to_torch(data: list[numpy.typing.ArrayLike])


   .. py:method:: _load_tensor_for_cache(object_id: str)

      Implementation of TensorCacheMixin abstract method.


   .. py:method:: _object_id_to_tensor(object_id: str)

      Converts an object_id to a pytorch tensor with dimensions (self.num_filters, self.cutout_shape[0],
      self.cutout_shape[1]). This is done by reading the file and slicing away any excess pixels at the
      far corners of the image from (0,0).

      The current implementation reads the files once the first time they are accessed, and then
      keeps them in a dict for future accesses.

      :param object_id: The object_id requested
      :type object_id: str

      :returns: A tensor with dimension (self.num_filters, self.cutout_shape[0], self.cutout_shape[1])
      :rtype: torch.Tensor


.. py:class:: LSSTDataset(config, data_location=None)

   Bases: :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`, :py:obj:`hyrax.data_sets.data_set_registry.HyraxImageDataset`, :py:obj:`torch.utils.data.Dataset`


   LSSTDataset: A dataset to access deep_coadd images from lsst pipelines
   via the butler. Must be run in an RSP.

   .. py:method:: __init__

   Initialize the dataset with either a HATS catalog or astropy table.

   Config can specify either:
   - config["data_set"]["hats_catalog"]: path to HATS catalog
   - config["data_set"]["astropy_table"]: path to any file readable by Astropy Table


   .. py:attribute:: BANDS
      :value: ['u', 'g', 'r', 'i', 'z', 'y']


   .. py:attribute:: object_id_autodetect_names
      :value: ['object_id', 'objectId']


   .. py:attribute:: catalog


   .. py:attribute:: sh_deg


   .. py:attribute:: sw_deg


   .. py:attribute:: oid_column_name


   .. py:method:: _butler_available()


   .. py:method:: _get_butler_thread_safe()

      Thread safe butler creation

      This function ensures that there is one and only one butler created per thread
      and that threads always use their assigned butler.

      This is necessary because child classes of this one use butlers, and butler
      objects are not safe for multithreaded access.

      :returns: The butler assigned to the current thread.
      :rtype: butler


   .. py:method:: _detect_object_id_column_name()

      Setup file naming strategy based on catalog columns.


   .. py:method:: _load_catalog(data_set_config)

      Load the catalog from either a HATS catalog or an astropy table.


   .. py:method:: _load_hats_catalog(hats_path)

      Load catalog from HATS format using LSDB.


   .. py:method:: _load_astropy_catalog(table_path)

      Load catalog from astropy table format or pickled astropy table.


   .. py:method:: __len__()


   .. py:method:: get_image(idxs)

      Get image cutouts for the given indices.

      :param idxs: The index or indices of the cutouts to retrieve.
      :type idxs: int or list of int

      :returns: Single cutout tensor or list of cutout tensors.
      :rtype: list or torch.Tensor


   .. py:method:: __getitem__(idxs)

      Get default data fields for the this dataset.

      :param idxs: The index or indices of the cutouts to retrieve.
      :type idxs: int or list of int

      :returns: A dictionary containing the default data fields.
      :rtype: dict


   .. py:method:: _parse_box(patch, row)

      Return a Box2I representing the desired cutout in pixel space, given a "row" of catalog data
      which includes the semi-height (sh) and semi-width (sw) in degrees desired for the cutout.


   .. py:method:: _parse_sphere_point(row)

      Return a SpherePoint with the ra and deck given in the "row" of catalog data.
      Row must include the RA and dec as "ra" and "dec" columns respectively


   .. py:method:: _get_tract_patch(row)

      Return (tractInfo, patchInfo) for a given row.

      This function only returns the single principle tract and patch in the case of overlap.


   .. py:method:: _request_patch(tract_index, patch_index)

      Request a patch from the butler. This will be a list of
      lsst.afw.image objects each corresponding to the configured
      bands

      Uses functools.lru_cache for basic in-memory caching.


   .. py:method:: _fetch_single_cutout(row)

      Make a single cutout, returning a torch tensor.

      Does not handle edge-of-tract/patch type edge cases, will only work near
      center of a patch.


.. py:class:: DownloadedLSSTDataset(config, data_location)

   Bases: :py:obj:`hyrax.data_sets.lsst_dataset.LSSTDataset`, :py:obj:`hyrax.data_sets.tensor_cache_mixin.TensorCacheMixin`


   DownloadedLSSTDataset: A dataset that inherits from LSSTDataset and downloads
   cutouts from the LSST butler, saving them as `.pt` files during first access.
   On subsequent accesses, it loads cutouts directly from these cached files.

   This class also creates a manifest files with the shape of each cutout and the
   corresponding filename.

   Public Methods:
       download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False):
           Download cutouts with parallel processing. Automatically resumes from
           previous progress. Use max_workers to control thread count, force_retry
           to re-attempt failed downloads.

       manifest_stats():
           Returns dict with download statistics: total, successful, failed, pending
           counts and manifest file path.

       download_progress():
           Returns detailed progress metrics including completion percentage and
           failure rates.

       reset_failed_downloads():
           Resets all failed download attempts to allow retry without force_retry flag.
           Returns count of reset entries.

       save_manifest_now():
           Forces immediate manifest save (normally saved periodically during downloads).

       cache_info():
           Returns LRU cache statistics for patch fetching performance monitoring.

       clear_cache():
           Clears the patch LRU cache to free memory.

   Usage Example:
       # Initialize Hyrax
       h = hyrax.Hyrax()
       a = h.prepare()

       # Download all cutouts (resumes automatically)
       a.download_cutouts(max_workers=4)
       WARNING: The LRU Caching scheme is slightly complicated, so it is recommended to
       use the default max_workers=1 for the first download. Simply using more workers
       may not always speed up the download process.

       # Check progress
       a.download_progress()

       # Retry failed downloads
       a.download_cutouts(force_retry=True)

       # Access cutouts (loads from cache)
       cutout = a[0]  # Single cutout
       cutouts = a[0:10]  # Multiple cutouts

   File Organization:
   - Cutouts saved as: cutout_{object_id}.pt or cutout_{index:04d}.pt
   - Manifest saved as: manifest.fits (Astropy) or manifest.parquet (HATS)
   - All files stored in config["general"]["data_dir"]

   .. py:method:: __init__

   Initialize the dataset with either a HATS catalog or astropy table.

   Config can specify either:
   - config["data_set"]["hats_catalog"]: path to HATS catalog
   - config["data_set"]["astropy_table"]: path to any file readable by Astropy Table


   .. py:attribute:: download_dir


   .. py:attribute:: catalog_object_ids


   .. py:attribute:: _manifest_lock


   .. py:attribute:: _updates_since_save
      :value: 0


   .. py:attribute:: _save_interval
      :value: 1000


   .. py:attribute:: _band_failure_stats


   .. py:attribute:: _band_failure_lock


   .. py:attribute:: _manifest_filter_object_ids
      :value: None


   .. py:attribute:: _catalog_to_manifest_index_map
      :value: None


   .. py:attribute:: _manifest_to_catalog_index_map
      :value: None


   .. py:method:: get_objectId(idx)

      Get object ID for a given index based on naming strategy.


   .. py:method:: ids(log_every=None)

      Generator yielding object IDs for the entire dataset. Required by TensorCacheMixin


   .. py:method:: _setup_naming_strategy()

      Setup file naming strategy based on catalog columns.


   .. py:method:: _initialize_manifest()

      Create new manifest or load/merge with existing manifest, with band filtering validation.

      The manifest is always an astropy Table with at least the following columns:
      cutout_shape: np.array of dimensions e.g. [3,150,150]
      filename: string containing the fits filename containing the tensor for the object
      downloaded_bands: string containing a comma separated list of the bands downloaded.
      Order is expected to be consistent between rows.

      When this astropy table is loaded into memory, multiple sources are consulted.
      - The Manifest on the filesystem, which contains the source of truth for what
      files have been downloaded. If this is not found, it is created.
      - The bands given in the catalog passed in


   .. py:method:: _load_existing_manifest()

      Load existing manifest file.


   .. py:method:: _update_manifest_from_catalog(existing_manifest)

      Using object_id as a unique key, adds manifest entries to existing_manifest,
      using self.catalog as the source of any new objects.

      self.catalog is not altered by this operation.

      Entries in existing_manifest are not altered by this operation.
      New entries are added to the end of existing_manifest with a state indicating
      they have not been downloaded.


   .. py:method:: _build_catalog_to_manifest_index_map()

      Build efficient mapping from catalog indices to manifest indices.


   .. py:method:: _add_manifest_columns_to_table(table)

      Add cutout_shape, filename, and downloaded_bands columns to manifest.


   .. py:method:: _longest_object_id_idx()


   .. py:method:: _get_available_bands_from_manifest(manifest)

      Best effort to get available bands by looking at first 10 successful downloads for consistency.


   .. py:method:: _setup_band_filtering(requested_bands, original_band_order)

      Setup band filtering to extract only requested bands from cached cutouts.


   .. py:method:: _get_cutout_path_from_idx(idx)

      Generate cutout file path for a given index.

      This simply applies a pattern to the filename using the object_id column.
      No guarantees are made about the file itself.


   .. py:method:: _get_cutout_path_from_manifest(idx)

      Get the cutout path by consulting the manifest

      The download thread ensures that the filename is not written to the manifest
      until all the bands that we intend to download are downloaded.

      This function is intended to be a thread safe way to get valid cutout paths.
      In the case where the file exists and is believed to be correctly downloaded
      you get a filename, but this will return None if there is some other issue.

      :param idx: The catalog index of the relevant cutout
      :type idx: int

      :returns: path to the cutout.
      :rtype: Path


   .. py:method:: _update_manifest_entry(idx, cutout_shape=None, filename='Attempted', downloaded_bands=None)

      Thread-safe manifest update with periodic saves.

      :param idx: Index in the manifest
      :param cutout_shape: Shape tuple of the cutout tensor, or None for failed downloads
      :param filename: Basename of the saved file, or "Attempted" only when ALL bands fail
      :param downloaded_bands: List of band names successfully downloaded in tensor order


   .. py:method:: _save_manifest()

      Save manifest


   .. py:method:: _sync_manifest_with_filesystem()

      Sync manifest with actual downloaded files on disk.

      This updates the manifest to reflect what is on the filesystem.
      For existing cutouts this loads every file using `torch.load`


   .. py:method:: _request_patch_cached(tract_index, patch_index, butler, skymap_name, bands_tuple)
      :staticmethod:


      Cached patch fetching using static method.

      Static method means no 'self' in cache key, making it truly global.
      Thread-safe because each call creates its own Butler instance.


   .. py:method:: _fetch_single_cutout(row, idx=None, manifest_idx=None)

      Fetch cutout, using saved cutout if available, with optional band filtering.


   .. py:method:: _fetch_cutout_with_cache(row)

      Generate cutout using cached patch fetching with NaN filling for failed bands.


   .. py:method:: _load_tensor_for_cache(object_id: str)

      Implementation of TensorCacheMixin abstract method.


   .. py:method:: __len__()

      Return length of current catalog, not the full manifest.


   .. py:method:: _get_manifest_index_for_catalog_index(catalog_idx)

      Map catalog index to manifest index. None return indicates no such item in manifest.


   .. py:method:: get_image(idxs)

      Fetch image cutout(s) for given index or indices, using caching and band filtering.

      Parameters:
      -----------
      idxs: int or slice or list
          Index or indices to fetch.

      Returns:
      --------
      torch.Tensor or list of torch.Tensor:
          Single cutout tensor or list of cutout tensors.


   .. py:method:: __getitem__(idxs) -> dict

      Modified to pass index for saving cutouts.

      Parameters:
      -----------
      idxs: int or slice or list
          Index or indices to fetch.

      Returns:
      --------
      dict:
          Dictionary with key 'data' containing another dict of default data fields
          to return. Currently only 'image' is supported.


   .. py:method:: download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False)

      Download cutouts using multiple threads with caching.

      :param indices: List of indices to download, or None for all
      :param sync_filesystem: Whether to sync manifest with existing files on disk
      :param max_workers: Maximum number of worker threads, or None to use default
      :param force_retry: Whether to retry previously failed downloads


   .. py:method:: _download_single_cutout(catalog_idx, manifest_idx)

      Helper method to download a single cutout.


   .. py:method:: cache_info()

      Get cache statistics.


   .. py:method:: clear_cache()

      Clear the LRU cache.


   .. py:method:: manifest_stats()

      Get manifest statistics including downloaded bands information.


   .. py:method:: band_filtering_info()

      Get information about current band filtering configuration.


   .. py:method:: save_manifest_now()

      Force immediate manifest save.


   .. py:method:: _determine_numprocs_download()
      :staticmethod:


      Determine number of threads for downloading.


   .. py:method:: reset_failed_downloads()

      Reset failed download attempts to allow retry.


   .. py:method:: download_progress()

      Get detailed download progress information.


   .. py:method:: download_summary()

      Get detailed download and band analysis, accounting for band filtering.


.. py:class:: HSCDataSet(config: dict, data_location=None)

   Bases: :py:obj:`hyrax.data_sets.fits_image_dataset.FitsImageDataSet`


   Dataset for sets of HSC cutouts created by the ``fibad download`` command.

   .. py:method:: __init__


   .. py:attribute:: _called_from_test
      :value: False


   .. py:attribute:: filters_config


   .. py:method:: _read_filter_catalog(filter_catalog_path: pathlib.Path | None)


   .. py:method:: _parse_filter_catalog(table) -> None

      Sets self.files by parsing the catalog.

      Subclasses may override this function to control parsing of the table more directly, but the
      overriding class must create the files dict which has type dict[object_id -> dict[filter -> filename]]
      with object_id, filter, and filename all strings.  In the case of no filter distinction, a single
      flag value may be used for the filter dict keys in the inner dicts.

      :param table: The catalog we read in
      :type table: Table


   .. py:method:: _set_crop_transform()

      Returns the crop transform on the image

      If overriden, subclass must:
      1) set self.cutout_shape to a tuple of ints representing the size of the cutouts that will be
      returned at some point in the init flow.

      2) Update the crop tranform using self.set_crop_transform() from the HyraxImageDataset mixin


   .. py:method:: _before_preload()


   .. py:method:: _scan_file_names(filters: list[str] | None, filter_obj_ids: list[str] | None = None) -> hyrax.data_sets.fits_image_dataset.files_dict

      Class initialization helper

      :param filters: List of filters that we should look for in the data corpus
      :type filters: list[str], Optional:
      :param filter_obj_ids: Filter the file scan to only file names which have the provided object IDs, skipping other files
                             When not provided, all file names in the configured data directory that match the pattern from
                             hyrax download are parsed.
      :type filter_obj_ids: list[str], Optional:

      :returns: Nested dictionary where the first level maps object_id -> dict, and the second level maps
                filter_name -> file name. Corresponds to self.files
      :rtype: dict[str,dict[str,str]]


   .. py:method:: _determine_numprocs() -> int
      :staticmethod:


   .. py:method:: _fixup_limit(nproc: int, res, est_limit, est_procs) -> int
      :staticmethod:


   .. py:method:: _scan_file_dimensions() -> dim_dict


   .. py:method:: _scan_file_dimension(processing_unit: tuple[str, list[str]]) -> tuple[str, list[tuple[int, int]]]
      :staticmethod:


   .. py:method:: _fits_file_dims(filepath) -> tuple[int, int]
      :staticmethod:


   .. py:method:: _prune_objects(filters_ref: list[str], cutout_shape: tuple[int, int] | None)

      Class initialization helper. Prunes objects from the list of objects.

      1) Removes any objects which do not have all the filters specified in filters_ref
      2) If a cutout_shape was provided in the constructor, prunes files that are too small
         for the chosen cutout size

      This function deletes from self.files and self.dims via _prune_object

      :param files: Nested dictionary where the first level maps object_id -> dict, and the second level maps
                    filter_name -> file name. This is created by _scan_files()
      :type files: dict[str,dict[str,str]]
      :param filters_ref: List of the filter names
      :type filters_ref: list[str]
      :param cutout_shape: Cutout shape tuple provided from constructor
      :type cutout_shape: tuple[int, int]


   .. py:method:: _mark_for_prune(object_id, reason)


   .. py:method:: _prune_object(object_id, reason: str)


   .. py:method:: _check_file_dimensions() -> tuple[int, int]

      Class initialization helper. Find the maximal pixel size that all images can support

      It is assumed that all the cutouts will be of very similar size; however, HSC's cutout
      server does not return exactly the same number of pixels for every query, even when it
      is given the same angular spread for every cutout.

      Machine learning models expect all images to be the same size.

      This function warns on significant differences (>2px) on any dimension between the largest
      and smallest images.

      :returns: The minimum width and height in pixels of the entire dataset. In other words: the maximal image
                size in pixels that can be generated from ALL cutout images via cropping.
      :rtype: tuple(int,int)


   .. py:method:: _rebuild_manifest(config)


   .. py:method:: __contains__(object_id: str) -> bool

      Allows you to do `object_id in dataset` queries. Used by testing code.

      :param object_id: The object ID you'd like to know if is in the dataset
      :type object_id: str

      :returns: True of the object_id given is in the data set
      :rtype: bool


   .. py:method:: _all_files_full()

      Private read-only iterator over all files that enforces a strict total order across
      objects and filters. Will not work prior to self.files, and self.path initialization in __init__

      :Yields: *Tuple[object_id, filter, filename, dim]* -- Members of this tuple are
               - The object_id as a string
               - The filter name as a string
               - The filename relative to self.path
               - A tuple containing the dimensions of the fits file in pixels.


   .. py:method:: _object_files(object_id)

      Private read-only iterator over all files for a given object. This enforces a strict total order
      across filters. Will not work prior to self.files, and self.path initialization in __init__

      Guaranteed to only return files that have filters in self.filters_ref.

      :Yields: *Path* -- The path to the file.


.. py:class:: HyraxCifarDataset(config: dict, data_location: pathlib.Path = None)

   Bases: :py:obj:`HyraxCifarBase`, :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`, :py:obj:`torch.utils.data.Dataset`


   Map style CIFAR 10 dataset for Hyrax

   This is simply a version of CIFAR10 that is initialized using Hyrax config with a transformation
   that works well for example code.

   We only use the training split in the data, because it is larger (50k images). Hyrax will then divide that
   into Train/test/Validate according to configuration.

   .. py:method:: __init__

   Overall initialization for all DataSets which saves the config

   Subclasses of HyraxDataSet ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.data_sets import HyraxDataset
       from torch.utils.data import Dataset

       class MyDataset(HyraxDataset, Dataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.data_sets import HyraxDataset
       from torch.utils.data import Dataset
       from astropy.table import Table

       class MyDataset(HyraxDataset, Dataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


.. py:class:: HyraxCifarIterableDataset(config: dict, data_location: pathlib.Path = None)

   Bases: :py:obj:`HyraxCifarBase`, :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`, :py:obj:`torch.utils.data.IterableDataset`


   Iterable style CIFAR 10 dataset for Hyrax

   This is simply a version of CIFAR10 that is initialized using Hyrax config with a transformation
   that works well for example code. This version only supports iteration, and not map-style access

   We only use the training split in the data, because it is larger (50k images). Hyrax will then divide that
   into Train/test/Validate according to configuration.

   .. py:method:: __init__

   Overall initialization for all DataSets which saves the config

   Subclasses of HyraxDataSet ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.data_sets import HyraxDataset
       from torch.utils.data import Dataset

       class MyDataset(HyraxDataset, Dataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.data_sets import HyraxDataset
       from torch.utils.data import Dataset
       from astropy.table import Table

       class MyDataset(HyraxDataset, Dataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:method:: __iter__()


.. py:class:: HyraxRandomDataset(config, data_location)

   Bases: :py:obj:`HyraxRandomDatasetBase`, :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`, :py:obj:`torch.utils.data.Dataset`


   This dataset is stand-in for a map-style dataset.
   It will produce random numpy arrays along with sequential numeric ids and,
   optionally, labels randomly selected from the provided list of possible labels.

   .. py:method:: __init__(config, data_location)

   Initialize the dataset using the parameters defined in the configuration.

   Parameter included for API consistency with other dataset classes, though
   not used by this implementation. All parameters are controlled by the following
   keys under the ``["data_set"]["HyraxRandomDataset"]`` table in the configuration:

   - ``size``: The number of random data samples to produce.
   - ``shape``: The shape of each random data sample as a tuple (e.g. (3, 29, 29) = 3
     layers of 2D data, each layer is 29x29 elements).
   - ``seed``: The random seed to use for reproducibility.
   - ``provided_labels``: A list of possible labels to randomly select from.
     If this is provided, the dataset will randomly select a label for each data sample.
   - ``metadata_fields``: A list of metadata field names. Used to create a metadata
     table with columns corresponding to each field name. All data is numeric.
   - ``number_invalid_values``: The number of invalid values to insert into the data.
   - ``invalid_value_type``: The type of invalid value to insert into the data.
     Valid values are "nan", "inf", "-inf", "none", or a float value.


   .. py:method:: __getitem__(idx: int) -> dict

      Get a data sample by index.

      The returned dictionary will contain the following keys:

      - ``index``: The index of the data sample.
      - ``object_id``: The ID of the data sample.
      - ``image``: The data sample as a numpy array.
      - ``label``: The label of the data sample (if provided).


      :param idx: The index of the data sample to retrieve.
      :type idx: int

      :returns: A dictionary containing the data sample and its metadata.
      :rtype: dict


   .. py:method:: __len__()

      Get the total number of samples in this dataset. This should be return
      the same value as the `size` parameter in the configuration.


   .. py:method:: ids()

      This function yields IDs for the dataset. It can be used as an iterable
      in a loop, or converted to a list by wrapping the function call in ``list(...)``.


.. py:class:: HyraxRandomIterableDataset(config, data_location)

   Bases: :py:obj:`HyraxRandomDatasetBase`, :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`, :py:obj:`torch.utils.data.IterableDataset`


   This dataset is stand-in for a iterable-style, or streaming, dataset.
   It will produce random numpy arrays and, optionally, labels randomly
   selected from the provided list of possible labels.

   .. note::

       While ids will be generated automatically for this dataset, calling the
       ``ids`` method of this dataset will return the index instead of the id.

   .. py:method:: __init__(config, data_location)

   Initialize the dataset using the parameters defined in the configuration.

   Parameter included for API consistency with other dataset classes, though
   not used by this implementation. All parameters are controlled by the following
   keys under the ``["data_set"]["HyraxRandomDataset"]`` table in the configuration:

   - ``size``: The number of random data samples to produce.
   - ``shape``: The shape of each random data sample as a tuple (e.g. (3, 29, 29) = 3
     layers of 2D data, each layer is 29x29 elements).
   - ``seed``: The random seed to use for reproducibility.
   - ``provided_labels``: A list of possible labels to randomly select from.
     If this is provided, the dataset will randomly select a label for each data sample.
   - ``metadata_fields``: A list of metadata field names. Used to create a metadata
     table with columns corresponding to each field name. All data is numeric.
   - ``number_invalid_values``: The number of invalid values to insert into the data.
   - ``invalid_value_type``: The type of invalid value to insert into the data.
     Valid values are "nan", "inf", "-inf", "none", or a float value.


   .. py:method:: __iter__()

      Yield the next data sample. The returned dictionary will have the
      following form:

      - ``data``: A dictionary containing:

        - ``index``: The index of the data sample.
        - ``object_id``: The value will be the same as ``index`` for this dataset.
        - ``image``: The data sample as a numpy array.
        - ``label``: The label of the data sample (if provided).

      :returns: A dictionary containing a data sample and its metadata.
      :rtype: dict


.. py:class:: HyraxRandomDatasetBase(config, data_location)

   This is the base class for the random datasets provided by Hyrax.

   .. warning::

       Direct use of ``HyraxRandomDatasetBase`` is not advised. When working
       with Hyrax, prefer to use ``HyraxRandomDataset`` or ``HyraxRandomIterableDataset``.

   .. py:method:: __init__(config, data_location)

   Initialize the dataset using the parameters defined in the configuration.

   Parameter included for API consistency with other dataset classes, though
   not used by this implementation. All parameters are controlled by the following
   keys under the ``["data_set"]["HyraxRandomDataset"]`` table in the configuration:

   - ``size``: The number of random data samples to produce.
   - ``shape``: The shape of each random data sample as a tuple (e.g. (3, 29, 29) = 3
     layers of 2D data, each layer is 29x29 elements).
   - ``seed``: The random seed to use for reproducibility.
   - ``provided_labels``: A list of possible labels to randomly select from.
     If this is provided, the dataset will randomly select a label for each data sample.
   - ``metadata_fields``: A list of metadata field names. Used to create a metadata
     table with columns corresponding to each field name. All data is numeric.
   - ``number_invalid_values``: The number of invalid values to insert into the data.
   - ``invalid_value_type``: The type of invalid value to insert into the data.
     Valid values are "nan", "inf", "-inf", "none", or a float value.


   .. py:attribute:: data
      :type:  numpy.ndarray

      The random data samples produced by the dataset.


   .. py:attribute:: id_list
      :type:  list

      A list of sequential numeric IDs for each data sample.


   .. py:attribute:: provided_labels
      :type:  list

      A list of labels randomly selected from the provided list of possible labels.


   .. py:attribute:: data_location


   .. py:method:: get_image(idx: int) -> numpy.ndarray

      Get the image at the given index as a NumPy array.


   .. py:method:: get_label(idx: int) -> str

      Get the label at the given index.


   .. py:method:: get_object_id(idx: int) -> str

      Get the index of the item.


.. py:class:: InferenceDataSet(config, results_dir: Union[pathlib.Path, str] | None = None, verb: str | None = None)

   Bases: :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`, :py:obj:`torch.utils.data.Dataset`


   This is a dataset class to represent the situations where we wish to treat the output of inference
   as a dataset. e.g. when performing umap/visualization operations

   Initialize an InferenceDataSet object.

   As a user of this code, you should almost never create this class, Instances of this class are
   returned by the umap and infer verbs. Prefer those over creating your own.

   If you do end up creating your own class, you will need a hyrax config, and to know some things
   about where the result you are interested in is stored.

   :param config: The hyrax config dictionary
   :type config: dict
   :param results_dir: The results subdirectory of the inference or umap results you want to access, by default None.
                       If no results subdirectory is provided, this function will attempt the following in order:

                       #. Use the directory specified in ``config['results']['inference_dir']`` if set and the directory
                          exists
                       #. Look in the results configured in ``config['general']['results_dir']`` (``./results/``
                          by default), then use the most recent results directory corresponding to the verb specified.
   :type results_dir: Optional[Union[Path, str]], optional
   :param verb: The name of the verb that generated the results, only important when the most recent results
                are being fetched. If no verb is provided, "infer" will be assumed.
   :type verb: Optional[str], optional

   :raises RuntimeError: When the provided results directory is corrupt, or cannot be found.


   .. py:attribute:: results_dir


   .. py:attribute:: batch_index


   .. py:attribute:: length


   .. py:attribute:: cached_batch_num
      :type:  int | None
      :value: None


   .. py:attribute:: shape_element


   .. py:attribute:: _original_dataset_config


   .. py:attribute:: original_dataset


   .. py:method:: _shape()

      The shape of the dataset (Discovered from files)

      :returns: Tuple with the shape of an individual element of the dataset
      :rtype: Tuple


   .. py:method:: ids() -> collections.abc.Generator[str]

      IDs of this dataset. Will return a string generator with IDs.

      These IDs are the IDs of the dataset used originally to generate this dataset.

      :returns: Generator that yields the string ids of this dataset
      :rtype: Generator[str]

      :Yields: *Generator[str]* -- Yields the string ids of this dataset


   .. py:method:: __getitem__(idx: Union[int, numpy.ndarray])

      Implements the ``[]`` operator

      :param idx: Either an index or a numpy array of indexes.
                  These are NOT the ID values of the dataset, but rather a zero-based index starting
                  at the beginning of the inference dataset.
      :type idx: Union[int, np.ndarray]

      :returns: Either the tensor corresponding to a single result, or a tensor with a multiplicity of
                results if multiple indexes were passed.
      :rtype: torch.tensor


   .. py:method:: __len__() -> int

      Returns the length of the dataset.

      :returns: Length of the dataset.
      :rtype: int


   .. py:property:: original_config
      :type: dict


      Get the original configuration for the dataset used to generate this inference dataset

      Since this sort of dataset is definitionally an intermediate product, this returns the
      runtime config used to construct that dataset rather than this one.

      :returns: Configuration that can be used to create the original dataset that was used
                as input for whatever inference process created this dataset.
      :rtype: dict


   .. py:method:: metadata_fields() -> list[str]

      Get the metadata fields associted with the original dataset used to generate this one

      :returns: List of valid field names for metadata queries
      :rtype: list[str]


   .. py:method:: metadata(idxs: numpy.typing.ArrayLike, fields: list[str]) -> numpy.typing.ArrayLike

      Get metadata associated with the data in the InferenceDataSet. This metadata comes from
      the original dataset, but is indexed according to the InferenceDataSet.

      :param idxs: Indexes in the InferenceDataSet for which metadata is desired
      :type idxs: npt.ArrayLike
      :param fields: Metadata fields requested
      :type fields: list[str]

      :returns: An array where the rows correspond to the passed list of indexes and the columns
                correspond to the fields passed. Order is preserved- metadata[i] corresponds to idxs[i].
      :rtype: npt.ArrayLike


   .. py:method:: _load_from_batch_file(batch_num: int, ids=Union[int, np.ndarray]) -> numpy.ndarray

      Hands back an array of tensors given a set of IDs in a particular batch and the given
      batch number


   .. py:method:: _resolve_results_dir(config, results_dir: Union[pathlib.Path, str] | None, verb: str | None) -> pathlib.Path

      Initialize an inference results directory as a data source. Accepts an override of what
      directory to use


.. py:class:: HyraxDataset(config: dict, metadata_table=None, object_id_column_name=None)

   How to make a hyrax dataset:

   .. code-block:: python

       from hyrax.data_sets import HyraxDataset
       from torch.utils.data import Dataset

       class MyDataset(HyraxDataset, Dataset):
           def __init__(self, config: dict):
               super().__init__(config)

           def __getitem__():
               # Your getitem goes here
               pass

           def __len__ ():
               # Your len function goes here
               pass

   Optional interfaces:

   ``ids()`` -> Subclasses may override this directly with their own ids function
   returning a generator of strings

   ``metadata`` -> Subclasses may pass an astropy table of metadata to ``__init__`` in the
   superclass. This table of metadata will be available through the ``metadata_fields`` and
   ``metadata`` functions.  If desired, a subclass may override these functions directly
   rather than using the astropy Table interface.

   Further documentation is in the :doc:`/pre_executed/custom_dataset` example notebook.


   .. py:method:: __init__

   Overall initialization for all DataSets which saves the config

   Subclasses of HyraxDataSet ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.data_sets import HyraxDataset
       from torch.utils.data import Dataset

       class MyDataset(HyraxDataset, Dataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.data_sets import HyraxDataset
       from torch.utils.data import Dataset
       from astropy.table import Table

       class MyDataset(HyraxDataset, Dataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: _config


   .. py:attribute:: _metadata_table
      :value: None


   .. py:attribute:: tensorboardx_logger
      :value: None


   .. py:method:: is_iterable()
      :classmethod:


      Returns true if underlying dataset is iterable style, supporting __iter__ vs map style
      where  __getitem__/__len__ are the preferred access methods.

      :returns: True if underlying dataset is iterable
      :rtype: bool


   .. py:method:: is_map()
      :classmethod:


      Returns true if underlying dataset is map style, supporting __getitem__/__len__ vs iterable
      where __iter__ is the preferred access method.

      :returns: True if underlying dataset is map-style
      :rtype: bool


   .. py:property:: config


   .. py:method:: __init_subclass__()
      :classmethod:


   .. py:method:: ids() -> collections.abc.Generator[str]

      This is the default IDs function you get when you derive from hyrax Dataset

      :returns: A generator yielding all the string IDs of the dataset.
      :rtype: Generator[str]


   .. py:method:: sample_data() -> dict

      Get a sample from the dataset. This is a convenience function that returns
      the first sample from the dataset, regardless of whether it is iterable
      or map-style. Often this will be used to instantiate a model that adjusts
      its form based on the shape of the data.


   .. py:method:: metadata_fields() -> list[str]

      Returns a list of metadata fields supported by this object

      :returns: The column names of the metadata table passed. Empty string if no metadata was provided at
                during construction of the HyraxDataset (or derived class).
      :rtype: list[str]


   .. py:method:: metadata(idxs: numpy.typing.ArrayLike, fields: list[str]) -> numpy.typing.ArrayLike

      Returns a table representing the metadata given an array of indexes and a list of fields.

      :param idxs: The indexes of the relevant tensor objects
      :type idxs: npt.ArrayLike
      :param fields: The names of the fields you would like returned. All values must be among those returned by
                     metadata_fields()
      :type fields: list[str]

      :returns: A numpy record array of your metadata, with only the columns specified.
                Roughly equivalent to: `metadata_table[idxs][fields].as_array()` where metadata_table is the
                astropy table that the HyraxDataset (or derived class) was constructed with.
      :rtype: npt.ArrayLike

      :raises RuntimeError: When none of the provided fields are


.. py:function:: iterable_dataset_collate(batch: list[dict]) -> dict

   Collate function used for iterable datasets since they do not work with DataProviders default collate

   Enable with h.config["data_loader"]["collate_fn"] = "hyrax.data_sets.iterable_dataset_collate"

   :param batch: The batch of data dictionaries returned from the iterble dataset
   :type batch: list[dict]

   :returns: Dict where each non-dict value is a np.array of items, ready for further hyrax processing.
   :rtype: dict

   :raises RuntimeError: If internal dictionary logic fails. This usually means an error in the structure of the input
       dictionary.


.. py:class:: HyraxCifarBase(config: dict, data_location: pathlib.Path = None)

   Base class for Hyrax Cifar datasets


   .. py:attribute:: data_location


   .. py:attribute:: training_data


   .. py:attribute:: cifar


   .. py:attribute:: id_width
      :value: 0


   .. py:method:: get_image(idx)

      Get the image at the given index as a NumPy array.


   .. py:method:: get_label(idx)

      Get the label at the given index.


   .. py:method:: get_index(idx)

      Get the index of the item.


   .. py:method:: get_object_id(idx)

      Get the object ID for the item.


   .. py:method:: ids()

      This is the default IDs function you get when you derive from hyrax Dataset

      :returns: A generator yielding all the string IDs of the dataset.
      :rtype: Generator[str]


.. py:class:: HyraxCSVDataset(config: dict, data_location: pathlib.Path = None)

   Bases: :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`


   A Hyrax Dataset for CSV files.

   This class reads a CSV file using pandas with memory mapping enabled.
   It dynamically creates getter methods for each column in the CSV file,
   allowing users to request data from specific columns.

   .. note::

      Column names found in the CSV file are used to create the getter methods.
      If a column name contains characters that are invalid for method names,
      those characters are replaced with underscores.

   .. rubric:: Examples

   Example model_inputs configuration::

       {
           "train": {
               "data": {
                   "dataset_class": "HyraxCSVDataset",
                   "data_location": "</path/to/data.csv>",
                   "fields": ["<column1>", "<column2>", ...],
                   "primary_id_field": "<column name that contains a unique ID>",
               },
           },
           "validate": { "<similar to above>" },
           "infer": { "<similar to above>" },
       }

   .. py:method:: __init__

   Overall initialization for all DataSets which saves the config

   Subclasses of HyraxDataSet ought call this at the end of their __init__ like:

   .. code-block:: python

       from hyrax.data_sets import HyraxDataset
       from torch.utils.data import Dataset

       class MyDataset(HyraxDataset, Dataset):
           def __init__(config):
               <your code>
               super().__init__(config)

   If per tensor metadata is available, it is recommended that dataset authors create an
   astropy Table of that data, in the same order as their data and pass that `metadata_table`
   as shown below:

   .. code-block:: python

       from hyrax.data_sets import HyraxDataset
       from torch.utils.data import Dataset
       from astropy.table import Table

       class MyDataset(HyraxDataset, Dataset):
           def __init__(config):
               <your code>
               metadata_table = Table(<Your catalog data goes here>)
               super().__init__(config, metadata_table)

   :param config: The runtime configuration for hyrax
   :type config: dict, Optional
   :param metadata_table: An Astropy Table with
                          1. the metadata columns desired for visualization AND
                          2. in the order your data will be enumerated.
   :type metadata_table: Optional[Table], optional
   :param object_id_column_name: The name of the column containing object IDs. If None, uses the default
                                 from config or creates one from the ids() method.
   :type object_id_column_name: Optional[str], optional


   .. py:attribute:: data_location
      :value: None


   .. py:attribute:: column_names


   .. py:attribute:: mem_mapped_csv
      :value: None


   .. py:method:: __getitem__(idx)

      Currently required by Hyrax machinery, but likely to be phased out.


   .. py:method:: __len__() -> int

      Return the number of records in the CSV.


   .. py:method:: sample_data()

      Return the first record, in dictionary form, as the sample.


   .. py:method:: is_map() -> bool
      :classmethod:


      Boilerplate method to indicate this is a map-style dataset.