hyrax.datasets.data_cache#
Attributes#
Classes#
DataCache tracks and manages a caching layer which can be used most effectively if the entirety of a |
Module Contents#
- class DataCache(config, data_provider: hyrax.datasets.data_provider.DataProvider)[source]#
DataCache tracks and manages a caching layer which can be used most effectively if the entirety of a training (or inference) epoch fits in system RAM.
Two configs control this functionality:
h.config[“data_set”][“use_cache”] which determines if we are serving data dictionaries out of a cache. When set, the first epoch of training fills the cache with tensors, and subsequent epochs are served out of the cache.
h.config[“data_set”][“preload_cache”] starts a thread which iterates over the dataset/dataloader class to completion. The thread pre-loads the cache with tensors independently of the training process. The hope is that this thread proceeds faster than the first epoch of training and speeds up the first epoch as well.
In this class we cache the output of DataProvider, before being batched. Users can control the size of data cached by only selecting particular fields in their data_request specification.
The class logs to the tensorboard logger in the DataProvider (when configured).
Initialize the DataCache with a Hyrax config.
- Parameters:
config (dict) – The Hyrax configuration that defines the data_request.
data_provider (DataProvider) – The DataProvider object which we are caching for.
- start_preload_thread()[source]#
Start the cache preload thread if configured
This exists to separate initialization from thread start in DataProvider’s constructor, so the thread started can always count on a fully initialized DataProvider.
- try_fetch(idx: int) dict | None[source]#
Try to fetch a data_dict from the cache.
- Parameters:
idx (int) – The DataProvider index of the data dict
- Returns:
The data dict from the cache, None on a cache miss.
- Return type:
Optional[dict]
- insert_into_cache(idx: int, data: dict[str, dict[str, Any]])[source]#
Insert a data dict into the cache
- Parameters:
idx (int) – Index of the data dict
data (dict[str, dict[str, Any]]) – The data dict
- _lazy_map_executor(executor: concurrent.futures.Executor, idxs: collections.abc.Iterable[int])[source]#
Lazy evaluation version of concurrent.futures.Executor.map().
This limits memory usage during preloading by keeping only a small number of data dictionaries in memory at once.
- Parameters:
executor (concurrent.futures.Executor) – An executor for running futures
idxs (Iterable[int]) – An iterable list of DataProvider indexes
- Yields:
Iterator[torch.Tensor] – An iterator over torch tensors, lazily loaded