hyrax.data_sets.fits_image_dataset
==================================

.. py:module:: hyrax.data_sets.fits_image_dataset

.. autoapi-nested-parse::

   FitsImageDataSet is for if you have image data in a single directory and some sort of tabular catalog file.

   At minimum, your tabular catalog **must** contain the following:

   #. A unique ID column for each astronomical object you are interested in
   #. A filename column containing the filename of the fits image file.
   #. If you have multiple images with the same object ID, they must have separate rows in the catalog, one for each image. There must be a column describing the filter on the telescope that differentiates these objects

   We recommend all your fits images be roughly the same size.

   Setting up hyrax to use FitsImageDataSet works as follows in a notebook. The same configuration options can go
   in a configuration file if you are running from the CLI

   .. code-block:: python

       import hyrax
       h = hyrax.Hyrax()
       h.config["data_set"]["name"] = "FitsImageDataSet"
       h.config["general"]["data_dir"] = "/file/path/to/where/your/fits/files/are"

       # Location of your catalog file. Any file format supported by astropy.Table will work
       h.config["data_set"]["filter_catalog"] = "/file/path/to/your/catalog.fits"

       # Size in pixels to send to ML model. All images must be this size or larger on
       # both dimensions
       h.config["data_set"]["crop_to"] = (100,100)

       # This is good to simply attempt to construct the dataset. Once things are working you might try
       # to train or infer
       dataset = h.prepare()

   This is the minimal setup that can work; however, there are several other configuration options you may need
   to set depending on your usage.

   The column names for the required columns are configurable. By default we use ``object_id``, ``filter``, and
   ``filename``; however, by setting ``h.config["data_set"]["object_id_column_name"]`` you can set the correct
   name for your catalog file. ``h.config["data_set"]["filter_column_name"]`` and
   ``h.config["data_set"]["filename_column_name"]`` work in a corresponding manner.

   If your dataset does not fit in memory on your system, we recommend setting
   ``h.config["data_set"]["use_cache"]`` and ``h.config["data_set"]["preload_cache"]`` to ``False``.
   Both are ``True`` by default. The former caches all tensors read during an epoch into system RAM, with the
   intent of speeding up later epochs of training if your disk has low bandwidth. The latter begins this process
   of caching all tensors into system RAM in a background thread as soon as the ``FitsImageDataSet`` is
   constructed, front-running the ``train`` or ``infer`` verb requesting tensors. The intent of this optimization
   is to speed up the first epoch of training in the case where your disk has high latency. Both will result in
   crashes if there is not enough room in your system RAM for the entire dataset.

   If you need to truncate your dataset to fit in RAM, the easiest way is to select a small number of rows
   from your original catalog file. FitsImageDataSet will only attempt to load images that exist in the catalog.


Attributes
----------

.. autoapisummary::

   hyrax.data_sets.fits_image_dataset.logger
   hyrax.data_sets.fits_image_dataset.files_dict


Classes
-------

.. autoapisummary::

   hyrax.data_sets.fits_image_dataset.FitsImageDataSet


Module Contents
---------------

.. py:data:: logger

.. py:data:: files_dict

.. py:class:: FitsImageDataSet(config: dict, data_location=None)

   Bases: :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`, :py:obj:`hyrax.data_sets.data_set_registry.HyraxImageDataset`, :py:obj:`hyrax.data_sets.tensor_cache_mixin.TensorCacheMixin`, :py:obj:`torch.utils.data.Dataset`


   Dataset for Fits Images, typically cutouts.

   .. py:method:: __init__

   Initialize a FitsImageDataSet

   Most work is done in ``_init_from_path`` and functions it calls in order to allow
   subclasses to override behavior.

   :param config: Nested configuration dictionary for hyrax
   :type config: dict
   :param data_location: The directory location of the data that this dataset class will access
   :type data_location: Optional[Union[Path, str]]


   .. py:attribute:: _called_from_test
      :value: False


   .. py:attribute:: _config


   .. py:attribute:: object_id_column_name


   .. py:attribute:: filter_column_name


   .. py:attribute:: filename_column_name


   .. py:method:: _init_from_path(path: Union[pathlib.Path, str])

      __init__ helper. Initialize an HSC data set from a path. This involves several filesystem scan
      operations and will ultimately open and read the header info of every fits file in the given directory

      :param path: Path or string specifying the directory path that is the root of all filenames in the
                   catalog table
      :type path: Union[Path, str]


   .. py:method:: _set_crop_transform()

      Returns the crop transform on the image

      If overriden, subclass must:
      1) set self.cutout_shape to a tuple of ints representing the size of the cutouts that will be
      returned at some point in the init flow.

      2) Update the crop tranform using self.set_crop_transform() from the HyraxImageDataset mixin


   .. py:method:: _read_filter_catalog(filter_catalog_path: pathlib.Path | None)


   .. py:method:: _parse_filter_catalog(table) -> None

      Sets self.files by parsing the catalog.

      Subclasses may override this function to control parsing of the table more directly, but the
      overriding class must create the files dict which has type dict[object_id -> dict[filter -> filename]]
      with object_id, filter, and filename all strings.  In the case of no filter distinction, a single
      flag value may be used for the filter dict keys in the inner dicts.

      :param table: The catalog we read in
      :type table: Table


   .. py:method:: _before_preload() -> None


   .. py:method:: _prepare_metadata()


   .. py:method:: shape() -> tuple[int, int, int]

      Shape of the individual cutouts this will give to a model

      :returns: Tuple describing the dimensions of the 3 dimensional tensor handed back to models
                The first index is the number of filters
                The second index is the width of each image
                The third index is the height of each image
      :rtype: tuple[int,int,int]


   .. py:method:: __len__() -> int

      Returns number of objects in this loader

      :returns: number of objects in this data loader
      :rtype: int


   .. py:method:: get_object_id(idx: int) -> str

      Get the object ID at the given index

      :param idx: Index of the object ID to return
      :type idx: int

      :returns: The object ID at the given index
      :rtype: str


   .. py:method:: get_image(idx: int)

      Get the image at the given index as a PyTorch Tensor.

      :param idx: Index of the image to return
      :type idx: int

      :returns: The image at the given index as a PyTorch Tensor.
      :rtype: torch.Tensor


   .. py:method:: __getitem__(idx: int)


   .. py:method:: __contains__(object_id: str) -> bool

      Allows you to do `object_id in dataset` queries. Used by testing code.

      :param object_id: The object ID you'd like to know if is in the dataset
      :type object_id: str

      :returns: True of the object_id given is in the data set
      :rtype: bool


   .. py:method:: _get_file(index: int) -> pathlib.Path

      Private indexing method across all files.

      Returns the file path corresponding to the given index.

      The index is zero-based and defined in the same manner as the total order of _all_files() and
      _object_files() iterator. Useful if you have an np.array() or list built from _all_files() and you
      need to select an individual item.

      Only valid after self.object_ids, self.files, self.path, and self.num_filters have been initialized
      in __init__

      :param index: Index, see above for order semantics
      :type index: int

      :returns: The path to the file
      :rtype: Path


   .. py:method:: ids(log_every=None) -> collections.abc.Generator[str]

      Public read-only iterator over all object_ids that enforces a strict total order across
      objects. Will not work prior to self.files initialization in __init__

      :Yields: *Iterator[str]* -- Object IDs currently in the dataset


   .. py:method:: _all_files()

      Private read-only iterator over all files that enforces a strict total order across
      objects and filters. Will not work prior to self.files, and self.path initialization in __init__

      :Yields: *Path* -- The path to the file.


   .. py:method:: _filter_filename(object_id)

      Private read-only iterator over all files for a given object. This enforces a strict total order
      across filters. Will not work prior to self.files initialization in __init__

      :Yields: *filter_name, file name* -- The name of a filter and the file name for the fits file.
               The file name is relative to self.path


   .. py:method:: _object_files(object_id)

      Private read-only iterator over all files for a given object. This enforces a strict total order
      across filters. Will not work prior to self.files, and self.path initialization in __init__

      :Yields: *Path* -- The path to the file.


   .. py:method:: _file_to_path(filename: str) -> pathlib.Path

      Turns a filename into a full path suitable for open. Equivalent to:

      `Path(self.path) / Path(filename)`

      :param filename: The filename string
      :type filename: str

      :returns: A full path that is openable.
      :rtype: Path


   .. py:method:: _read_object_id(object_id: str)


   .. py:method:: _convert_to_torch(data: list[numpy.typing.ArrayLike])


   .. py:method:: _load_tensor_for_cache(object_id: str)

      Implementation of TensorCacheMixin abstract method.


   .. py:method:: _object_id_to_tensor(object_id: str)

      Converts an object_id to a pytorch tensor with dimensions (self.num_filters, self.cutout_shape[0],
      self.cutout_shape[1]). This is done by reading the file and slicing away any excess pixels at the
      far corners of the image from (0,0).

      The current implementation reads the files once the first time they are accessed, and then
      keeps them in a dict for future accesses.

      :param object_id: The object_id requested
      :type object_id: str

      :returns: A tensor with dimension (self.num_filters, self.cutout_shape[0], self.cutout_shape[1])
      :rtype: torch.Tensor