hyrax.data_sets.inference_dataset
=================================

.. py:module:: hyrax.data_sets.inference_dataset


Attributes
----------

.. autoapisummary::

   hyrax.data_sets.inference_dataset.logger
   hyrax.data_sets.inference_dataset.ORIGINAL_DATASET_CONFIG_FILENAME


Classes
-------

.. autoapisummary::

   hyrax.data_sets.inference_dataset.InferenceDataSet
   hyrax.data_sets.inference_dataset.InferenceDataSetWriter


Module Contents
---------------

.. py:data:: logger

.. py:data:: ORIGINAL_DATASET_CONFIG_FILENAME
   :value: 'original_dataset_config.toml'


.. py:class:: InferenceDataSet(config, results_dir: Union[pathlib.Path, str] | None = None, verb: str | None = None)

   Bases: :py:obj:`hyrax.data_sets.data_set_registry.HyraxDataset`, :py:obj:`torch.utils.data.Dataset`


   This is a dataset class to represent the situations where we wish to treat the output of inference
   as a dataset. e.g. when performing umap/visualization operations

   Initialize an InferenceDataSet object.

   As a user of this code, you should almost never create this class, Instances of this class are
   returned by the umap and infer verbs. Prefer those over creating your own.

   If you do end up creating your own class, you will need a hyrax config, and to know some things
   about where the result you are interested in is stored.

   :param config: The hyrax config dictionary
   :type config: dict
   :param results_dir: The results subdirectory of the inference or umap results you want to access, by default None.
                       If no results subdirectory is provided, this function will attempt the following in order:

                       #. Use the directory specified in ``config['results']['inference_dir']`` if set and the directory
                          exists
                       #. Look in the results configured in ``config['general']['results_dir']`` (``./results/``
                          by default), then use the most recent results directory corresponding to the verb specified.
   :type results_dir: Optional[Union[Path, str]], optional
   :param verb: The name of the verb that generated the results, only important when the most recent results
                are being fetched. If no verb is provided, "infer" will be assumed.
   :type verb: Optional[str], optional

   :raises RuntimeError: When the provided results directory is corrupt, or cannot be found.


   .. py:attribute:: results_dir


   .. py:attribute:: batch_index


   .. py:attribute:: length


   .. py:attribute:: cached_batch_num
      :type:  int | None
      :value: None


   .. py:attribute:: shape_element


   .. py:attribute:: _original_dataset_config


   .. py:attribute:: original_dataset


   .. py:method:: _shape()

      The shape of the dataset (Discovered from files)

      :returns: Tuple with the shape of an individual element of the dataset
      :rtype: Tuple


   .. py:method:: ids() -> collections.abc.Generator[str]

      IDs of this dataset. Will return a string generator with IDs.

      These IDs are the IDs of the dataset used originally to generate this dataset.

      :returns: Generator that yields the string ids of this dataset
      :rtype: Generator[str]

      :Yields: *Generator[str]* -- Yields the string ids of this dataset


   .. py:method:: __getitem__(idx: Union[int, numpy.ndarray])

      Implements the ``[]`` operator

      :param idx: Either an index or a numpy array of indexes.
                  These are NOT the ID values of the dataset, but rather a zero-based index starting
                  at the beginning of the inference dataset.
      :type idx: Union[int, np.ndarray]

      :returns: Either the tensor corresponding to a single result, or a tensor with a multiplicity of
                results if multiple indexes were passed.
      :rtype: torch.tensor


   .. py:method:: __len__() -> int

      Returns the length of the dataset.

      :returns: Length of the dataset.
      :rtype: int


   .. py:property:: original_config
      :type: dict


      Get the original configuration for the dataset used to generate this inference dataset

      Since this sort of dataset is definitionally an intermediate product, this returns the
      runtime config used to construct that dataset rather than this one.

      :returns: Configuration that can be used to create the original dataset that was used
                as input for whatever inference process created this dataset.
      :rtype: dict


   .. py:method:: metadata_fields() -> list[str]

      Get the metadata fields associted with the original dataset used to generate this one

      :returns: List of valid field names for metadata queries
      :rtype: list[str]


   .. py:method:: metadata(idxs: numpy.typing.ArrayLike, fields: list[str]) -> numpy.typing.ArrayLike

      Get metadata associated with the data in the InferenceDataSet. This metadata comes from
      the original dataset, but is indexed according to the InferenceDataSet.

      :param idxs: Indexes in the InferenceDataSet for which metadata is desired
      :type idxs: npt.ArrayLike
      :param fields: Metadata fields requested
      :type fields: list[str]

      :returns: An array where the rows correspond to the passed list of indexes and the columns
                correspond to the fields passed. Order is preserved- metadata[i] corresponds to idxs[i].
      :rtype: npt.ArrayLike


   .. py:method:: _load_from_batch_file(batch_num: int, ids=Union[int, np.ndarray]) -> numpy.ndarray

      Hands back an array of tensors given a set of IDs in a particular batch and the given
      batch number


   .. py:method:: _resolve_results_dir(config, results_dir: Union[pathlib.Path, str] | None, verb: str | None) -> pathlib.Path

      Initialize an inference results directory as a data source. Accepts an override of what
      directory to use


.. py:class:: InferenceDataSetWriter(original_dataset: torch.utils.data.Dataset, result_dir: Union[str, pathlib.Path])

   Class to write out inference datasets. Used by infer, umap to consistently write out numpy
   files in batches which can be read by InferenceDataSet.

   With the exception of building ID->Batch indexing info, this is implemented as a bag-o-functions that
   manipulate the filesystem directly as their primary effect.

   .. py:method:: __init__


   .. py:attribute:: result_dir


   .. py:attribute:: batch_index
      :value: 0


   .. py:attribute:: id_dtype
      :value: None


   .. py:attribute:: all_ids


   .. py:attribute:: all_batch_nums


   .. py:attribute:: writer_pool


   .. py:attribute:: original_dataset_config


   .. py:method:: write_batch(ids: numpy.ndarray, tensors: list[numpy.ndarray])

      Write a batch of tensors into the dataset. This writes the whole batch immediately.
      Caller is in charge of batch size consistency considerations, and that ids is the same length as
      tensors

      :param ids: Array of IDs, dtype of the elements must match the dtype type of the ids of the original dataset
                  used to construct this InferenceDataSetWriter.
      :type ids: np.ndarray
      :param tensors: List of consistently dimensioned numpy arrays to save.
      :type tensors: list[np.ndarray]


   .. py:method:: write_index()

      Writes out the batch index built up by this object over multiple write_batch calls.
      See save_batch_index for details.


   .. py:method:: _save_batch_index()

      Save a batch index in the result directory provided


   .. py:method:: _save_file(filename: str, data: numpy.ndarray)

      Save a numpy array to a file in the result directory provided