hyrax.data_sets.downloaded_lsst_dataset#
Attributes#
Classes#
DownloadedLSSTDataset: A dataset that inherits from LSSTDataset and downloads |
Module Contents#
- class DownloadedLSSTDataset(config, data_location)[source]#
Bases:
hyrax.data_sets.lsst_dataset.LSSTDataset,hyrax.data_sets.tensor_cache_mixin.TensorCacheMixinDownloadedLSSTDataset: A dataset that inherits from LSSTDataset and downloads cutouts from the LSST butler, saving them as .pt files during first access. On subsequent accesses, it loads cutouts directly from these cached files.
This class also creates a manifest files with the shape of each cutout and the corresponding filename.
- Public Methods:
- download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False):
Download cutouts with parallel processing. Automatically resumes from previous progress. Use max_workers to control thread count, force_retry to re-attempt failed downloads.
- manifest_stats():
Returns dict with download statistics: total, successful, failed, pending counts and manifest file path.
- download_progress():
Returns detailed progress metrics including completion percentage and failure rates.
- reset_failed_downloads():
Resets all failed download attempts to allow retry without force_retry flag. Returns count of reset entries.
- save_manifest_now():
Forces immediate manifest save (normally saved periodically during downloads).
- cache_info():
Returns LRU cache statistics for patch fetching performance monitoring.
- clear_cache():
Clears the patch LRU cache to free memory.
- Usage Example:
# Initialize Hyrax h = hyrax.Hyrax() a = h.prepare()
# Download all cutouts (resumes automatically) a.download_cutouts(max_workers=4) WARNING: The LRU Caching scheme is slightly complicated, so it is recommended to use the default max_workers=1 for the first download. Simply using more workers may not always speed up the download process.
# Check progress a.download_progress()
# Retry failed downloads a.download_cutouts(force_retry=True)
# Access cutouts (loads from cache) cutout = a[0] # Single cutout cutouts = a[0:10] # Multiple cutouts
File Organization: - Cutouts saved as: cutout_{object_id}.pt or cutout_{index:04d}.pt - Manifest saved as: manifest.fits (Astropy) or manifest.parquet (HATS) - All files stored in config[“general”][“data_dir”]
Initialize the dataset with either a HATS catalog or astropy table.
Config can specify either: - config[“data_set”][“hats_catalog”]: path to HATS catalog - config[“data_set”][“astropy_table”]: path to any file readable by Astropy Table
- ids(log_every=None)[source]#
Generator yielding object IDs for the entire dataset. Required by TensorCacheMixin
- _initialize_manifest()[source]#
Create new manifest or load/merge with existing manifest, with band filtering validation.
The manifest is always an astropy Table with at least the following columns: cutout_shape: np.array of dimensions e.g. [3,150,150] filename: string containing the fits filename containing the tensor for the object downloaded_bands: string containing a comma separated list of the bands downloaded. Order is expected to be consistent between rows.
When this astropy table is loaded into memory, multiple sources are consulted. - The Manifest on the filesystem, which contains the source of truth for what files have been downloaded. If this is not found, it is created. - The bands given in the catalog passed in
- _update_manifest_from_catalog(existing_manifest)[source]#
Using object_id as a unique key, adds manifest entries to existing_manifest, using self.catalog as the source of any new objects.
self.catalog is not altered by this operation.
Entries in existing_manifest are not altered by this operation. New entries are added to the end of existing_manifest with a state indicating they have not been downloaded.
- _build_catalog_to_manifest_index_map()[source]#
Build efficient mapping from catalog indices to manifest indices.
- _add_manifest_columns_to_table(table)[source]#
Add cutout_shape, filename, and downloaded_bands columns to manifest.
- _get_available_bands_from_manifest(manifest)[source]#
Best effort to get available bands by looking at first 10 successful downloads for consistency.
- _setup_band_filtering(requested_bands, original_band_order)[source]#
Setup band filtering to extract only requested bands from cached cutouts.
- _get_cutout_path_from_idx(idx)[source]#
Generate cutout file path for a given index.
This simply applies a pattern to the filename using the object_id column. No guarantees are made about the file itself.
- _get_cutout_path_from_manifest(idx)[source]#
Get the cutout path by consulting the manifest
The download thread ensures that the filename is not written to the manifest until all the bands that we intend to download are downloaded.
This function is intended to be a thread safe way to get valid cutout paths. In the case where the file exists and is believed to be correctly downloaded you get a filename, but this will return None if there is some other issue.
- Parameters:
idx (int) – The catalog index of the relevant cutout
- Returns:
path to the cutout.
- Return type:
Path
- _update_manifest_entry(idx, cutout_shape=None, filename='Attempted', downloaded_bands=None)[source]#
Thread-safe manifest update with periodic saves.
- Parameters:
idx – Index in the manifest
cutout_shape – Shape tuple of the cutout tensor, or None for failed downloads
filename – Basename of the saved file, or “Attempted” only when ALL bands fail
downloaded_bands – List of band names successfully downloaded in tensor order
- _sync_manifest_with_filesystem()[source]#
Sync manifest with actual downloaded files on disk.
This updates the manifest to reflect what is on the filesystem. For existing cutouts this loads every file using torch.load
- static _request_patch_cached(tract_index, patch_index, butler, skymap_name, bands_tuple)[source]#
Cached patch fetching using static method.
Static method means no ‘self’ in cache key, making it truly global. Thread-safe because each call creates its own Butler instance.
- _fetch_single_cutout(row, idx=None, manifest_idx=None)[source]#
Fetch cutout, using saved cutout if available, with optional band filtering.
- _fetch_cutout_with_cache(row)[source]#
Generate cutout using cached patch fetching with NaN filling for failed bands.
- _get_manifest_index_for_catalog_index(catalog_idx)[source]#
Map catalog index to manifest index. None return indicates no such item in manifest.
- get_image(idxs)[source]#
Fetch image cutout(s) for given index or indices, using caching and band filtering.
Parameters:#
- idxs: int or slice or list
Index or indices to fetch.
Returns:#
- torch.Tensor or list of torch.Tensor:
Single cutout tensor or list of cutout tensors.
- __getitem__(idxs) dict[source]#
Modified to pass index for saving cutouts.
Parameters:#
- idxs: int or slice or list
Index or indices to fetch.
Returns:#
- dict:
Dictionary with key ‘data’ containing another dict of default data fields to return. Currently only ‘image’ is supported.
- download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False)[source]#
Download cutouts using multiple threads with caching.
- Parameters:
indices – List of indices to download, or None for all
sync_filesystem – Whether to sync manifest with existing files on disk
max_workers – Maximum number of worker threads, or None to use default
force_retry – Whether to retry previously failed downloads