hyrax.data_sets.downloaded_lsst_dataset
Attributes
Classes
DownloadedLSSTDataset: A dataset that inherits from LSSTDataset and downloads |
Module Contents
- class DownloadedLSSTDataset(config, data_location)[source]
Bases:
hyrax.data_sets.lsst_dataset.LSSTDataset,hyrax.data_sets.tensor_cache_mixin.TensorCacheMixinDownloadedLSSTDataset: A dataset that inherits from LSSTDataset and downloads cutouts from the LSST butler, saving them as .pt files during first access. On subsequent accesses, it loads cutouts directly from these cached files.
This class also creates a manifest files with the shape of each cutout and the corresponding filename.
- Public Methods:
- download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False):
Download cutouts with parallel processing. Automatically resumes from previous progress. Use max_workers to control thread count, force_retry to re-attempt failed downloads.
- manifest_stats():
Returns dict with download statistics: total, successful, failed, pending counts and manifest file path.
- download_progress():
Returns detailed progress metrics including completion percentage and failure rates.
- reset_failed_downloads():
Resets all failed download attempts to allow retry without force_retry flag. Returns count of reset entries.
- save_manifest_now():
Forces immediate manifest save (normally saved periodically during downloads).
- cache_info():
Returns LRU cache statistics for patch fetching performance monitoring.
- clear_cache():
Clears the patch LRU cache to free memory.
- Usage Example:
# Initialize Hyrax h = hyrax.Hyrax() a = h.prepare()
# Download all cutouts (resumes automatically) a.download_cutouts(max_workers=4) WARNING: The LRU Caching scheme is slightly complicated, so it is recommended to use the default max_workers=1 for the first download. Simply using more workers may not always speed up the download process.
# Check progress a.download_progress()
# Retry failed downloads a.download_cutouts(force_retry=True)
# Access cutouts (loads from cache) cutout = a[0] # Single cutout cutouts = a[0:10] # Multiple cutouts
File Organization: - Cutouts saved as: cutout_{object_id}.pt or cutout_{index:04d}.pt - Manifest saved as: manifest.fits (Astropy) or manifest.parquet (HATS) - All files stored in config[“general”][“data_dir”]
Initialize the dataset with either a HATS catalog or astropy table.
Config can specify either: - config[“data_set”][“hats_catalog”]: path to HATS catalog - config[“data_set”][“astropy_table”]: path to any file readable by Astropy Table
- ids(log_every=None)[source]
Generator yielding object IDs for the entire dataset. Required by TensorCacheMixin
- _initialize_manifest()[source]
Create new manifest or load/merge with existing manifest, with band filtering validation.
- _merge_manifests(existing_manifest)[source]
Merge existing manifest with current catalog based on object_id.
- _build_catalog_to_manifest_index_map(manifest)[source]
Build efficient mapping from catalog indices to manifest indices.
- _add_manifest_columns()[source]
Add cutout_shape, filename, and downloaded_bands columns to manifest.
- _get_available_bands_from_manifest(manifest)[source]
Get available bands by checking first 10 successful downloads for consistency.
- _setup_band_filtering(requested_bands, original_band_order)[source]
Setup band filtering to extract only requested bands from cached cutouts.
- _update_manifest_entry(idx, cutout_shape=None, filename='Attempted', downloaded_bands=None)[source]
Thread-safe manifest update with periodic saves.
- Parameters:
idx – Index in the manifest
cutout_shape – Shape tuple of the cutout tensor, or None for failed downloads
filename – Basename of the saved file, or “Attempted” only when ALL bands fail
downloaded_bands – List of band names successfully downloaded in tensor order
- static _request_patch_cached(tract_index, patch_index, butler_repo, butler_collections, skymap_name, bands_tuple)[source]
Cached patch fetching using static method.
Static method means no ‘self’ in cache key, making it truly global. Thread-safe because each call creates its own Butler instance.
- _fetch_single_cutout(row, idx=None, manifest_idx=None)[source]
Fetch cutout, using saved cutout if available, with optional band filtering.
- _fetch_cutout_with_cache(row)[source]
Generate cutout using cached patch fetching with NaN filling for failed bands.
- _get_manifest_index_for_catalog_index(catalog_idx)[source]
Map catalog index to manifest index when filtering is active.
- get_image(idxs)[source]
Fetch image cutout(s) for given index or indices, using caching and band filtering.
Parameters:
- idxs: int or slice or list
Index or indices to fetch.
Returns:
- torch.Tensor or list of torch.Tensor:
Single cutout tensor or list of cutout tensors.
- __getitem__(idxs) dict[source]
Modified to pass index for saving cutouts.
Parameters:
- idxs: int or slice or list
Index or indices to fetch.
Returns:
- dict:
Dictionary with key ‘data’ containing another dict of default data fields to return. Currently only ‘image’ is supported.
- download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False)[source]
Download cutouts using multiple threads with caching.
- Parameters:
indices – List of indices to download, or None for all
sync_filesystem – Whether to sync manifest with existing files on disk
max_workers – Maximum number of worker threads, or None to use default
force_retry – Whether to retry previously failed downloads