hyrax.data_sets.fits_image_dataset

FitsImageDataSet is for if you have image data in a single directory and some sort of tabular catalog file.

At minimum, your tabular catalog must contain the following:

A unique ID column for each astronomical object you are interested in
A filename column containing the filename of the fits image file.
If you have multiple images with the same object ID, they must have separate rows in the catalog, one for each image. There must be a column describing the filter on the telescope that differentiates these objects

We recommend all your fits images be roughly the same size.

Setting up hyrax to use FitsImageDataSet works as follows in a notebook. The same configuration options can go in a configuration file if you are running from the CLI

import hyrax
h = hyrax.Hyrax()
h.config["data_set"]["name"] = "FitsImageDataSet"
h.config["general"]["data_dir"] = "/file/path/to/where/your/fits/files/are"

# Location of your catalog file. Any file format supported by astropy.Table will work
h.config["data_set"]["filter_catalog"] = "/file/path/to/your/catalog.fits"

# Size in pixels to send to ML model. All images must be this size or larger on
# both dimensions
h.config["data_set"]["crop_to"] = (100,100)

# This is good to simply attempt to construct the dataset. Once things are working you might try
# to train or infer
dataset = h.prepare()

This is the minimal setup that can work; however, there are several other configuration options you may need to set depending on your usage.

The column names for the required columns are configurable. By default we use object_id, filter, and filename; however, by setting h.config["data_set"]["object_id_column_name"] you can set the correct name for your catalog file. h.config["data_set"]["filter_column_name"] and h.config["data_set"]["filename_column_name"] work in a corresponding manner.

If your dataset does not fit in memory on your system, we recommend setting h.config["data_set"]["use_cache"] and h.config["data_set"]["preload_cache"] to False. Both are True by default. The former caches all tensors read during an epoch into system RAM, with the intent of speeding up later epochs of training if your disk has low bandwidth. The latter begins this process of caching all tensors into system RAM in a background thread as soon as the FitsImageDataSet is constructed, front-running the train or infer verb requesting tensors. The intent of this optimization is to speed up the first epoch of training in the case where your disk has high latency. Both will result in crashes if there is not enough room in your system RAM for the entire dataset.

If you need to truncate your dataset to fit in RAM, the easiest way is to select a small number of rows from your original catalog file. FitsImageDataSet will only attempt to load images that exist in the catalog.

Attributes

`logger`
`files_dict`

Classes

FitsImageDataSet

Dataset for Fits Images, typically cutouts.

Module Contents

logger[source]

files_dict[source]

class FitsImageDataSet(config: dict, data_location=None)[source]

Bases: hyrax.data_sets.data_set_registry.HyraxDataset, hyrax.data_sets.data_set_registry.HyraxImageDataset, torch.utils.data.Dataset

Dataset for Fits Images, typically cutouts.

__init__()[source]

Initialize a FitsImageDataSet

Most work is done in _init_from_path and functions it calls in order to allow subclasses to override behavior.

Parameters:

config (dict) – Nested configuration dictionary for hyrax
data_location (Optional[Union[Path, str]]) – The directory location of the data that this dataset class will access

_called_from_test = False[source]

_config[source]

use_cache[source]

object_id_column_name[source]

filter_column_name[source]

filename_column_name[source]

_init_from_path(path: pathlib.Path | str)[source]

__init__ helper. Initialize an HSC data set from a path. This involves several filesystem scan operations and will ultimately open and read the header info of every fits file in the given directory

Parameters:: path (Union[Path, str]) – Path or string specifying the directory path that is the root of all filenames in the catalog table

_set_crop_transform()[source]

Returns the crop transform on the image

If overriden, subclass must: 1) set self.cutout_shape to a tuple of ints representing the size of the cutouts that will be returned at some point in the init flow.

Update the crop tranform using self.set_crop_transform() from the HyraxImageDataset mixin

_read_filter_catalog(filter_catalog_path: pathlib.Path | None)[source]

_parse_filter_catalog(table) → None[source]

Sets self.files by parsing the catalog.

Subclasses may override this function to control parsing of the table more directly, but the overriding class must create the files dict which has type dict[object_id -> dict[filter -> filename]] with object_id, filter, and filename all strings. In the case of no filter distinction, a single flag value may be used for the filter dict keys in the inner dicts.

Parameters:: table (Table) – The catalog we read in

_before_preload() → None[source]

_prepare_metadata()[source]

shape() → tuple[int, int, int][source]

Shape of the individual cutouts this will give to a model

Returns:: Tuple describing the dimensions of the 3 dimensional tensor handed back to models The first index is the number of filters The second index is the width of each image The third index is the height of each image
Return type:: tuple[int,int,int]

__len__() → int[source]

Returns number of objects in this loader

Returns:: number of objects in this data loader
Return type:: int

get_object_id(idx: int) → str[source]

Get the object ID at the given index

Parameters:: idx (int) – Index of the object ID to return
Returns:: The object ID at the given index
Return type:: str

get_image(idx: int)[source]

Get the image at the given index as a PyTorch Tensor.

Parameters:: idx (int) – Index of the image to return
Returns:: The image at the given index as a PyTorch Tensor.
Return type:: torch.Tensor

__getitem__(idx: int)[source]

__contains__(object_id: str) → bool[source]

Allows you to do object_id in dataset queries. Used by testing code.

Parameters:: object_id (str) – The object ID you’d like to know if is in the dataset
Returns:: True of the object_id given is in the data set
Return type:: bool

_get_file(index: int) → pathlib.Path[source]

Private indexing method across all files.

Returns the file path corresponding to the given index.

The index is zero-based and defined in the same manner as the total order of _all_files() and _object_files() iterator. Useful if you have an np.array() or list built from _all_files() and you need to select an individual item.

Only valid after self.object_ids, self.files, self.path, and self.num_filters have been initialized in __init__

Parameters:: index (int) – Index, see above for order semantics
Returns:: The path to the file
Return type:: Path

ids(log_every=None) → collections.abc.Generator[str][source]

Public read-only iterator over all object_ids that enforces a strict total order across objects. Will not work prior to self.files initialization in __init__

Yields:: Iterator[str] – Object IDs currently in the dataset

_all_files()[source]

Private read-only iterator over all files that enforces a strict total order across objects and filters. Will not work prior to self.files, and self.path initialization in __init__

Yields:: Path – The path to the file.

_filter_filename(object_id)[source]

Private read-only iterator over all files for a given object. This enforces a strict total order across filters. Will not work prior to self.files initialization in __init__

Yields:: filter_name, file name – The name of a filter and the file name for the fits file. The file name is relative to self.path

_object_files(object_id)[source]

Private read-only iterator over all files for a given object. This enforces a strict total order across filters. Will not work prior to self.files, and self.path initialization in __init__

Yields:: Path – The path to the file.

_file_to_path(filename: str) → pathlib.Path[source]

Turns a filename into a full path suitable for open. Equivalent to:

Path(self.path) / Path(filename)

Parameters:: filename (str) – The filename string
Returns:: A full path that is openable.
Return type:: Path

static _determine_numprocs_preload()[source]

_preload_tensor_cache()[source]: When preloading the tensor cache is configured, this is called on a separate thread by __init__() to perform a preload of every tensor in the dataset.

_lazy_map_executor(executor: concurrent.futures.Executor, ids: collections.abc.Iterable[str])[source]

This is a version of concurrent.futures.Executor map() which lazily evaluates the iterator passed We do this because we do not want all of the tensors to remain in memory during pre-loading. We would prefer a smaller set of in-flight tensors.

The total number of in progress jobs is set at FitsImageDataSet._determine_numprocs().

The total number of tensors is slightly greater than that owing to out-of-order execution.

This approach was copied from: https://gist.github.com/CallumJHays/0841c5fdb7b2774d2a0b9b8233689761

Parameters:

executor (concurrent.futures.Executor) – An executour for running our futures
work_fn (Callable[[str], torch.Tensor]) – The function that makes tensors out of object_ids
ids (Iterable[str]) – An iterable list of object IDs.

Yields:

Iterator[torch.Tensor] – An iterator over torch tensors, lazily loaded by running the work_fn as needed.

_log_duration_tensorboard(name: str, start_time: int)[source]

Log a duration to tensorboardX. NOOP if no tensorboard logger configured

The time logged is a floating point number of seconds derived from integer monotonic nanosecond measurements. time.monotonic_ns() is used for the current time

The step number for the scalar series is an integer number of microseonds.

Parameters:

name (str) – The name of the scalar to log to tensorboard
start_time (int) – integer number of nanoseconds. Should be from time.monotonic_ns() when the duration started

_check_object_id_to_tensor_cache(object_id: str)[source]

_populate_object_id_to_tensor_cache(object_id: str)[source]

_read_object_id(object_id: str)[source]

_convert_to_torch(data: list[numpy.typing.ArrayLike])[source]

_object_id_to_tensor(object_id: str)[source]

Converts an object_id to a pytorch tensor with dimensions (self.num_filters, self.cutout_shape[0], self.cutout_shape[1]). This is done by reading the file and slicing away any excess pixels at the far corners of the image from (0,0).

The current implementation reads the files once the first time they are accessed, and then keeps them in a dict for future accesses.

Parameters:: object_id (str) – The object_id requested
Returns:: A tensor with dimension (self.num_filters, self.cutout_shape[0], self.cutout_shape[1])
Return type:: torch.Tensor