hyrax.data_sets.downloaded_lsst_dataset
=======================================

.. py:module:: hyrax.data_sets.downloaded_lsst_dataset


Attributes
----------

.. autoapisummary::

   hyrax.data_sets.downloaded_lsst_dataset.logger


Classes
-------

.. autoapisummary::

   hyrax.data_sets.downloaded_lsst_dataset.DownloadedLSSTDataset


Module Contents
---------------

.. py:data:: logger

.. py:class:: DownloadedLSSTDataset(config, data_location)

   Bases: :py:obj:`hyrax.data_sets.lsst_dataset.LSSTDataset`, :py:obj:`hyrax.data_sets.tensor_cache_mixin.TensorCacheMixin`


   DownloadedLSSTDataset: A dataset that inherits from LSSTDataset and downloads
   cutouts from the LSST butler, saving them as `.pt` files during first access.
   On subsequent accesses, it loads cutouts directly from these cached files.

   This class also creates a manifest files with the shape of each cutout and the
   corresponding filename.

   Public Methods:
       download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False):
           Download cutouts with parallel processing. Automatically resumes from
           previous progress. Use max_workers to control thread count, force_retry
           to re-attempt failed downloads.

       manifest_stats():
           Returns dict with download statistics: total, successful, failed, pending
           counts and manifest file path.

       download_progress():
           Returns detailed progress metrics including completion percentage and
           failure rates.

       reset_failed_downloads():
           Resets all failed download attempts to allow retry without force_retry flag.
           Returns count of reset entries.

       save_manifest_now():
           Forces immediate manifest save (normally saved periodically during downloads).

       cache_info():
           Returns LRU cache statistics for patch fetching performance monitoring.

       clear_cache():
           Clears the patch LRU cache to free memory.

   Usage Example:
       # Initialize Hyrax
       h = hyrax.Hyrax()
       a = h.prepare()

       # Download all cutouts (resumes automatically)
       a.download_cutouts(max_workers=4)
       WARNING: The LRU Caching scheme is slightly complicated, so it is recommended to
       use the default max_workers=1 for the first download. Simply using more workers
       may not always speed up the download process.

       # Check progress
       a.download_progress()

       # Retry failed downloads
       a.download_cutouts(force_retry=True)

       # Access cutouts (loads from cache)
       cutout = a[0]  # Single cutout
       cutouts = a[0:10]  # Multiple cutouts

   File Organization:
   - Cutouts saved as: cutout_{object_id}.pt or cutout_{index:04d}.pt
   - Manifest saved as: manifest.fits (Astropy) or manifest.parquet (HATS)
   - All files stored in config["general"]["data_dir"]

   .. py:method:: __init__

   Initialize the dataset with either a HATS catalog or astropy table.

   Config can specify either:
   - config["data_set"]["hats_catalog"]: path to HATS catalog
   - config["data_set"]["astropy_table"]: path to any file readable by Astropy Table



   .. py:attribute:: download_dir


   .. py:attribute:: catalog_object_ids


   .. py:attribute:: _manifest_lock


   .. py:attribute:: _updates_since_save
      :value: 0



   .. py:attribute:: _save_interval
      :value: 1000



   .. py:attribute:: _band_failure_stats


   .. py:attribute:: _band_failure_lock


   .. py:attribute:: _manifest_filter_object_ids
      :value: None



   .. py:attribute:: _catalog_to_manifest_index_map
      :value: None



   .. py:attribute:: _manifest_to_catalog_index_map
      :value: None



   .. py:method:: get_objectId(idx)

      Get object ID for a given index based on naming strategy.



   .. py:method:: ids(log_every=None)

      Generator yielding object IDs for the entire dataset. Required by TensorCacheMixin



   .. py:method:: _setup_naming_strategy()

      Setup file naming strategy based on catalog columns.



   .. py:method:: _initialize_manifest()

      Create new manifest or load/merge with existing manifest, with band filtering validation.

      The manifest is always an astropy Table with at least the following columns:
      cutout_shape: np.array of dimensions e.g. [3,150,150]
      filename: string containing the fits filename containing the tensor for the object
      downloaded_bands: string containing a comma separated list of the bands downloaded.
      Order is expected to be consistent between rows.

      When this astropy table is loaded into memory, multiple sources are consulted.
      - The Manifest on the filesystem, which contains the source of truth for what
      files have been downloaded. If this is not found, it is created.
      - The bands given in the catalog passed in




   .. py:method:: _load_existing_manifest()

      Load existing manifest file.



   .. py:method:: _update_manifest_from_catalog(existing_manifest)

      Using object_id as a unique key, adds manifest entries to existing_manifest,
      using self.catalog as the source of any new objects.

      self.catalog is not altered by this operation.

      Entries in existing_manifest are not altered by this operation.
      New entries are added to the end of existing_manifest with a state indicating
      they have not been downloaded.



   .. py:method:: _build_catalog_to_manifest_index_map()

      Build efficient mapping from catalog indices to manifest indices.



   .. py:method:: _add_manifest_columns_to_table(table)

      Add cutout_shape, filename, and downloaded_bands columns to manifest.



   .. py:method:: _longest_object_id_idx()


   .. py:method:: _get_available_bands_from_manifest(manifest)

      Best effort to get available bands by looking at first 10 successful downloads for consistency.



   .. py:method:: _setup_band_filtering(requested_bands, original_band_order)

      Setup band filtering to extract only requested bands from cached cutouts.



   .. py:method:: _get_cutout_path_from_idx(idx)

      Generate cutout file path for a given index.

      This simply applies a pattern to the filename using the object_id column.
      No guarantees are made about the file itself.




   .. py:method:: _get_cutout_path_from_manifest(idx)

      Get the cutout path by consulting the manifest

      The download thread ensures that the filename is not written to the manifest
      until all the bands that we intend to download are downloaded.

      This function is intended to be a thread safe way to get valid cutout paths.
      In the case where the file exists and is believed to be correctly downloaded
      you get a filename, but this will return None if there is some other issue.

      :param idx: The catalog index of the relevant cutout
      :type idx: int

      :returns: path to the cutout.
      :rtype: Path



   .. py:method:: _update_manifest_entry(idx, cutout_shape=None, filename='Attempted', downloaded_bands=None)

      Thread-safe manifest update with periodic saves.

      :param idx: Index in the manifest
      :param cutout_shape: Shape tuple of the cutout tensor, or None for failed downloads
      :param filename: Basename of the saved file, or "Attempted" only when ALL bands fail
      :param downloaded_bands: List of band names successfully downloaded in tensor order



   .. py:method:: _save_manifest()

      Save manifest



   .. py:method:: _sync_manifest_with_filesystem()

      Sync manifest with actual downloaded files on disk.

      This updates the manifest to reflect what is on the filesystem.
      For existing cutouts this loads every file using `torch.load`




   .. py:method:: _request_patch_cached(tract_index, patch_index, butler, skymap_name, bands_tuple)
      :staticmethod:


      Cached patch fetching using static method.

      Static method means no 'self' in cache key, making it truly global.
      Thread-safe because each call creates its own Butler instance.



   .. py:method:: _fetch_single_cutout(row, idx=None, manifest_idx=None)

      Fetch cutout, using saved cutout if available, with optional band filtering.



   .. py:method:: _fetch_cutout_with_cache(row)

      Generate cutout using cached patch fetching with NaN filling for failed bands.



   .. py:method:: _load_tensor_for_cache(object_id: str)

      Implementation of TensorCacheMixin abstract method.



   .. py:method:: __len__()

      Return length of current catalog, not the full manifest.



   .. py:method:: _get_manifest_index_for_catalog_index(catalog_idx)

      Map catalog index to manifest index. None return indicates no such item in manifest.



   .. py:method:: get_image(idxs)

      Fetch image cutout(s) for given index or indices, using caching and band filtering.

      Parameters:
      -----------
      idxs: int or slice or list
          Index or indices to fetch.

      Returns:
      --------
      torch.Tensor or list of torch.Tensor:
          Single cutout tensor or list of cutout tensors.



   .. py:method:: __getitem__(idxs) -> dict

      Modified to pass index for saving cutouts.

      Parameters:
      -----------
      idxs: int or slice or list
          Index or indices to fetch.

      Returns:
      --------
      dict:
          Dictionary with key 'data' containing another dict of default data fields
          to return. Currently only 'image' is supported.



   .. py:method:: download_cutouts(indices=None, sync_filesystem=True, max_workers=None, force_retry=False)

      Download cutouts using multiple threads with caching.

      :param indices: List of indices to download, or None for all
      :param sync_filesystem: Whether to sync manifest with existing files on disk
      :param max_workers: Maximum number of worker threads, or None to use default
      :param force_retry: Whether to retry previously failed downloads



   .. py:method:: _download_single_cutout(catalog_idx, manifest_idx)

      Helper method to download a single cutout.



   .. py:method:: cache_info()

      Get cache statistics.



   .. py:method:: clear_cache()

      Clear the LRU cache.



   .. py:method:: manifest_stats()

      Get manifest statistics including downloaded bands information.



   .. py:method:: band_filtering_info()

      Get information about current band filtering configuration.



   .. py:method:: save_manifest_now()

      Force immediate manifest save.



   .. py:method:: _determine_numprocs_download()
      :staticmethod:


      Determine number of threads for downloading.



   .. py:method:: reset_failed_downloads()

      Reset failed download attempts to allow retry.



   .. py:method:: download_progress()

      Get detailed download progress information.



   .. py:method:: download_summary()

      Get detailed download and band analysis, accounting for band filtering.



