hyrax.datasets.data_cache#

Attributes#

Classes#

DataCache

DataCache tracks and manages a caching layer which can be used most effectively if the entirety of a

Module Contents#

logger[source]#
tensorboardx_logger[source]#
class DataCache(config, data_provider: hyrax.datasets.data_provider.DataProvider)[source]#

DataCache tracks and manages a caching layer which can be used most effectively if the entirety of a training (or inference) epoch fits in system RAM.

Two configs control this functionality:

h.config[“data_set”][“use_cache”] which determines if we are serving data dictionaries out of a cache. When set, the first epoch of training fills the cache with tensors, and subsequent epochs are served out of the cache.

h.config[“data_set”][“preload_cache”] starts a thread which iterates over the dataset/dataloader class to completion. The thread pre-loads the cache with tensors independently of the training process. The hope is that this thread proceeds faster than the first epoch of training and speeds up the first epoch as well.

In this class we cache the output of DataProvider, before being batched. Users can control the size of data cached by only selecting particular fields in their data_request specification.

The class logs to the tensorboard logger in the DataProvider (when configured).

Initialize the DataCache with a Hyrax config.

Parameters:
  • config (dict) – The Hyrax configuration that defines the data_request.

  • data_provider (DataProvider) – The DataProvider object which we are caching for.

_max_length[source]#
_resolve_data_func[source]#
_data_provider[source]#
_use_cache[source]#
_preload_cache[source]#
_data_size_bytes = 0[source]#
_insert_count = 0[source]#
logging_interval = 1000[source]#
_cache_map[source]#
_preload_thread = None[source]#
start_preload_thread()[source]#

Start the cache preload thread if configured

This exists to separate initialization from thread start in DataProvider’s constructor, so the thread started can always count on a fully initialized DataProvider.

_idx_check(idx)[source]#
try_fetch(idx: int) dict | None[source]#

Try to fetch a data_dict from the cache.

Parameters:

idx (int) – The DataProvider index of the data dict

Returns:

The data dict from the cache, None on a cache miss.

Return type:

Optional[dict]

insert_into_cache(idx: int, data: dict[str, dict[str, Any]])[source]#

Insert a data dict into the cache

Parameters:
  • idx (int) – Index of the data dict

  • data (dict[str, dict[str, Any]]) – The data dict

static _data_size(data, seen: set[int] | None = None) int[source]#
_preload_tensor_cache()[source]#

Preload all tensors in the dataset using multiple threads.

_lazy_map_executor(executor: concurrent.futures.Executor, idxs: collections.abc.Iterable[int])[source]#

Lazy evaluation version of concurrent.futures.Executor.map().

This limits memory usage during preloading by keeping only a small number of data dictionaries in memory at once.

Parameters:
  • executor (concurrent.futures.Executor) – An executor for running futures

  • idxs (Iterable[int]) – An iterable list of DataProvider indexes

Yields:

Iterator[torch.Tensor] – An iterator over torch tensors, lazily loaded