hyrax.datasets.downloaded_lsst_dataset

Contents

hyrax.datasets.downloaded_lsst_dataset#

Attributes#

Classes#

DownloadedLSSTDataset

DownloadedLSSTDataset: A dataset that inherits from LSSTDataset and downloads

Module Contents#

logger[source]#
class DownloadedLSSTDataset(config, data_location)[source]#

Bases: hyrax.datasets.lsst_dataset.LSSTDataset

DownloadedLSSTDataset: A dataset that inherits from LSSTDataset and downloads cutouts from the LSST butler, saving them as .pt files during first access. On subsequent accesses, it loads cutouts directly from these cached files.

This class also creates a manifest files with the shape of each cutout and the corresponding filename.

Public Methods:
download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False):

Download cutouts with parallel processing. Automatically resumes from previous progress. Use max_workers to control thread count, force_retry to re-attempt failed downloads.

manifest_stats():

Returns dict with download statistics: total, successful, failed, pending counts and manifest file path.

download_progress():

Returns detailed progress metrics including completion percentage and failure rates.

reset_failed_downloads():

Resets all failed download attempts to allow retry without force_retry flag. Returns count of reset entries.

save_manifest_now():

Forces immediate manifest save (normally saved periodically during downloads).

cache_info():

Returns LRU cache statistics for patch fetching performance monitoring.

clear_cache():

Clears the patch LRU cache to free memory.

Usage Example:

# Initialize Hyrax h = hyrax.Hyrax() a = h.prepare()

# Download all cutouts (resumes automatically) a.download_cutouts(max_workers=4) WARNING: The LRU Caching scheme is slightly complicated, so it is recommended to use the default max_workers=1 for the first download. Simply using more workers may not always speed up the download process.

# Check progress a.download_progress()

# Retry failed downloads a.download_cutouts(force_retry=True)

# Access cutouts (loads from cache) cutout = a[0] # Single cutout cutouts = a[0:10] # Multiple cutouts

File Organization: - Cutouts saved as: cutout_{object_id}.pt or cutout_{index:04d}.pt - Manifest saved as: manifest.fits (Astropy) or manifest.parquet (HATS) - All files stored in the data_location provided during initialization

__init__()[source]#

Initialize the dataset with either a HATS catalog or astropy table.

Config can specify either: - config[“data_set”][“hats_catalog”]: path to HATS catalog - config[“data_set”][“astropy_table”]: path to any file readable by Astropy Table

download_dir[source]#
catalog_object_ids[source]#
_manifest_lock[source]#
_updates_since_save = 0[source]#
_save_interval = 1000[source]#
_band_failure_stats[source]#
_band_failure_lock[source]#
_manifest_filter_object_ids = None[source]#
_catalog_to_manifest_index_map = None[source]#
_manifest_to_catalog_index_map = None[source]#
get_objectId(idx)[source]#

Get object ID for a given index based on naming strategy.

_setup_naming_strategy()[source]#

Setup file naming strategy based on catalog columns.

_initialize_manifest()[source]#

Create new manifest or load/merge with existing manifest, with band filtering validation.

The manifest is always an astropy Table with at least the following columns: cutout_shape: np.array of dimensions e.g. [3,150,150] filename: string containing the fits filename containing the tensor for the object downloaded_bands: string containing a comma separated list of the bands downloaded. Order is expected to be consistent between rows.

When this astropy table is loaded into memory, multiple sources are consulted. - The Manifest on the filesystem, which contains the source of truth for what files have been downloaded. If this is not found, it is created. - The bands given in the catalog passed in

_load_existing_manifest()[source]#

Load existing manifest file.

_update_manifest_from_catalog(existing_manifest)[source]#

Using object_id as a unique key, adds manifest entries to existing_manifest, using self.catalog as the source of any new objects.

self.catalog is not altered by this operation.

Entries in existing_manifest are not altered by this operation. New entries are added to the end of existing_manifest with a state indicating they have not been downloaded.

_build_catalog_to_manifest_index_map()[source]#

Build efficient mapping from catalog indices to manifest indices.

_add_manifest_columns_to_table(table)[source]#

Add cutout_shape, filename, and downloaded_bands columns to manifest.

_longest_object_id_idx()[source]#
_get_available_bands_from_manifest(manifest)[source]#

Get available bands by finding entries with complete band coverage.

Uses cutout_shape[0] to determine the expected number of bands, then finds entries where downloaded_bands has that many entries (i.e., complete downloads).

_setup_band_filtering(requested_bands, original_band_order)[source]#

Setup band filtering to extract only requested bands from cached cutouts.

_get_cutout_path_from_idx(idx)[source]#

Generate cutout file path for a given index.

This simply applies a pattern to the filename using the object_id column. No guarantees are made about the file itself.

_get_cutout_path_from_manifest(idx)[source]#

Get the cutout path by consulting the manifest

The download thread ensures that the filename is not written to the manifest until all the bands that we intend to download are downloaded.

This function is intended to be a thread safe way to get valid cutout paths. In the case where the file exists and is believed to be correctly downloaded you get a filename, but this will return None if there is some other issue.

Parameters:

idx (int) – The catalog index of the relevant cutout

Returns:

path to the cutout.

Return type:

Path

_update_manifest_entry(idx, cutout_shape=None, filename='Attempted', downloaded_bands=None)[source]#

Thread-safe manifest update with periodic saves.

Parameters:
  • idx – Index in the manifest

  • cutout_shape – Shape tuple of the cutout tensor, or None for failed downloads

  • filename – Basename of the saved file, or “Attempted” only when ALL bands fail

  • downloaded_bands – List of band names successfully downloaded in tensor order

_save_manifest()[source]#

Save manifest

_sync_manifest_with_filesystem()[source]#

Sync manifest with actual downloaded files on disk.

This updates the manifest to reflect what is on the filesystem. For existing cutouts this loads every file using torch.load

static _request_patch_cached(tract_index, patch_index, butler, skymap_name, bands_tuple)[source]#

Cached patch fetching using static method.

Static method means no ‘self’ in cache key, making it truly global. Thread-safe because each call creates its own Butler instance.

_fetch_single_cutout(row, idx=None, manifest_idx=None)[source]#

Fetch cutout, using saved cutout if available, with optional band filtering.

_fetch_cutout_with_cache(row)[source]#

Generate cutout using cached patch fetching with NaN filling for failed bands.

__len__()[source]#

Return length of current catalog, not the full manifest.

_get_manifest_index_for_catalog_index(catalog_idx)[source]#

Map catalog index to manifest index. None return indicates no such item in manifest.

get_image(idxs)[source]#

Fetch image cutout(s) for given index or indices, using caching and band filtering.

Parameters:#

idxs: int or slice or list

Index or indices to fetch.

Returns:#

torch.Tensor or list of torch.Tensor:

Single cutout tensor or list of cutout tensors.

__getitem__(idxs) dict[source]#

Modified to pass index for saving cutouts.

Parameters:#

idxs: int or slice or list

Index or indices to fetch.

Returns:#

dict:

Dictionary with key ‘data’ containing another dict of default data fields to return. Currently only ‘image’ is supported.

download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False)[source]#

Download cutouts using multiple threads with caching.

Parameters:
  • indices – List of indices to download, or None for all

  • sync_filesystem – Whether to sync manifest with existing files on disk

  • max_workers – Maximum number of worker threads, or None to use default

  • force_retry – Whether to retry previously failed downloads

_download_single_cutout(catalog_idx, manifest_idx)[source]#

Helper method to download a single cutout.

cache_info()[source]#

Get cache statistics.

clear_cache()[source]#

Clear the LRU cache.

manifest_stats()[source]#

Get manifest statistics including downloaded bands information.

band_filtering_info()[source]#

Get information about current band filtering configuration.

save_manifest_now()[source]#

Force immediate manifest save.

static _determine_numprocs_download()[source]#

Determine number of threads for downloading.

reset_failed_downloads()[source]#

Reset failed download attempts to allow retry.

download_progress()[source]#

Get detailed download progress information.

download_summary()[source]#

Get detailed download and band analysis, accounting for band filtering.