hyrax.datasets.data_cache
=========================

.. py:module:: hyrax.datasets.data_cache


Attributes
----------

.. autoapisummary::

   hyrax.datasets.data_cache.logger
   hyrax.datasets.data_cache.tensorboardx_logger


Classes
-------

.. autoapisummary::

   hyrax.datasets.data_cache.DataCache


Module Contents
---------------

.. py:data:: logger

.. py:data:: tensorboardx_logger

.. py:class:: DataCache(config, data_provider: hyrax.datasets.data_provider.DataProvider)

   DataCache tracks and manages a caching layer which can be used most effectively if the entirety of a
   training (or inference) epoch fits in system RAM.

   Two configs control this functionality:

   `h.config["data_set"]["use_cache"]` which determines if we are serving data dictionaries out of a cache.
   When set, the first epoch of training fills the cache with tensors, and subsequent epochs are served out
   of the cache.

   `h.config["data_set"]["preload_cache"]` starts a thread which iterates over the dataset/dataloader class
   to completion. The thread pre-loads the cache with tensors independently of the training process. The
   hope is that this thread proceeds faster than the first epoch of training and speeds up the first epoch
   as well.

   In this class we cache the output of DataProvider, before being batched. Users can control the size of
   data cached by only selecting particular fields in their data_request specification.

   The class logs to the tensorboard logger in the DataProvider (when configured).

   Initialize the DataCache with a Hyrax config.

   :param config: The Hyrax configuration that defines the data_request.
   :type config: dict
   :param data_provider: The DataProvider object which we are caching for.
   :type data_provider: DataProvider


   .. py:attribute:: _max_length


   .. py:attribute:: _resolve_data_func


   .. py:attribute:: _data_provider


   .. py:attribute:: _use_cache


   .. py:attribute:: _preload_cache


   .. py:attribute:: _data_size_bytes
      :value: 0


   .. py:attribute:: _insert_count
      :value: 0


   .. py:attribute:: logging_interval
      :value: 1000


   .. py:attribute:: _cache_map


   .. py:attribute:: _preload_thread
      :value: None


   .. py:method:: start_preload_thread()

      Start the cache preload thread if configured

      This exists to separate initialization from thread start in DataProvider's
      constructor, so the thread started can always count on a fully initialized DataProvider.


   .. py:method:: _idx_check(idx)


   .. py:method:: try_fetch(idx: int) -> dict | None

      Try to fetch a data_dict from the cache.

      :param idx: The DataProvider index of the data dict
      :type idx: int

      :returns: The data dict from the cache, None on a cache miss.
      :rtype: Optional[dict]


   .. py:method:: insert_into_cache(idx: int, data: dict[str, dict[str, Any]])

      Insert a data dict into the cache

      :param idx: Index of the data dict
      :type idx: int
      :param data: The data dict
      :type data: dict[str, dict[str, Any]]


   .. py:method:: _data_size(data, seen: set[int] | None = None) -> int
      :staticmethod:


   .. py:method:: _preload_tensor_cache()

      Preload all tensors in the dataset using multiple threads.


   .. py:method:: _lazy_map_executor(executor: concurrent.futures.Executor, idxs: collections.abc.Iterable[int])

      Lazy evaluation version of concurrent.futures.Executor.map().

      This limits memory usage during preloading by keeping only a small
      number of data dictionaries in memory at once.

      :param executor: An executor for running futures
      :type executor: concurrent.futures.Executor
      :param idxs: An iterable list of DataProvider indexes
      :type idxs: Iterable[int]

      :Yields: *Iterator[torch.Tensor]* -- An iterator over torch tensors, lazily loaded