hyrax.data_sets.inference_dataset

Attributes

logger

ORIGINAL_DATASET_CONFIG_FILENAME

Classes

InferenceDataSet

This is a dataset class to represent the situations where we wish to treat the output of inference

InferenceDataSetWriter

Class to write out inference datasets. Used by infer, umap to consistently write out numpy

Module Contents

logger[source]
ORIGINAL_DATASET_CONFIG_FILENAME = 'original_dataset_config.toml'[source]
class InferenceDataSet(config, results_dir: pathlib.Path | str | None = None, verb: str | None = None)[source]

Bases: hyrax.data_sets.data_set_registry.HyraxDataset, torch.utils.data.Dataset

This is a dataset class to represent the situations where we wish to treat the output of inference as a dataset. e.g. when performing umap/visualization operations

Initialize an InferenceDataSet object.

As a user of this code, you should almost never create this class, Instances of this class are returned by the umap and infer verbs. Prefer those over creating your own.

If you do end up creating your own class, you will need a hyrax config, and to know some things about where the result you are interested in is stored.

Parameters:
  • config (dict) – The hyrax config dictionary

  • results_dir (Optional[Union[Path, str]], optional) –

    The results subdirectory of the inference or umap results you want to access, by default None. If no results subdirectory is provided, this function will attempt the following in order:

    1. Use the directory specified in config['results']['inference_dir'] if set and the directory exists

    2. Look in the results configured in config['general']['results_dir'] (./results/ by default), then use the most recent results directory corresponding to the verb specified.

  • verb (Optional[str], optional) – The name of the verb that generated the results, only important when the most recent results are being fetched. If no verb is provided, “infer” will be assumed.

Raises:

RuntimeError – When the provided results directory is corrupt, or cannot be found.

results_dir[source]
batch_index[source]
length[source]
cached_batch_num: int | None = None[source]
shape_element[source]
_original_dataset_config[source]
original_dataset[source]
_shape()[source]

The shape of the dataset (Discovered from files)

Returns:

Tuple with the shape of an individual element of the dataset

Return type:

Tuple

ids() collections.abc.Generator[str][source]

IDs of this dataset. Will return a string generator with IDs.

These IDs are the IDs of the dataset used originally to generate this dataset.

Returns:

Generator that yields the string ids of this dataset

Return type:

Generator[str]

Yields:

Generator[str] – Yields the string ids of this dataset

__getitem__(idx: int | numpy.ndarray)[source]

Implements the [] operator

Parameters:

idx (Union[int, np.ndarray]) – Either an index or a numpy array of indexes. These are NOT the ID values of the dataset, but rather a zero-based index starting at the beginning of the inference dataset.

Returns:

Either the tensor corresponding to a single result, or a tensor with a multiplicity of results if multiple indexes were passed.

Return type:

torch.tensor

__len__() int[source]

Returns the length of the dataset.

Returns:

Length of the dataset.

Return type:

int

property original_config: dict[source]

Get the original configuration for the dataset used to generate this inference dataset

Since this sort of dataset is definitionally an intermediate product, this returns the runtime config used to construct that dataset rather than this one.

Returns:

Configuration that can be used to create the original dataset that was used as input for whatever inference process created this dataset.

Return type:

ConfigDict

metadata_fields() list[str][source]

Get the metadata fields associted with the original dataset used to generate this one

Returns:

List of valid field names for metadata queries

Return type:

list[str]

metadata(idxs: numpy.typing.ArrayLike, fields: list[str]) numpy.typing.ArrayLike[source]

Get metadata associated with the data in the InferenceDataSet. This metadata comes from the original dataset, but is indexed according to the InferenceDataSet.

Parameters:
  • idxs (npt.ArrayLike) – Indexes in the InferenceDataSet for which metadata is desired

  • fields (list[str]) – Metadata fields requested

Returns:

An array where the rows correspond to the passed list of indexes and the columns correspond to the fields passed. Order is preserved- metadata[i] corresponds to idxs[i].

Return type:

npt.ArrayLike

_load_from_batch_file(batch_num: int, ids=Union[int, np.ndarray]) numpy.ndarray[source]

Hands back an array of tensors given a set of IDs in a particular batch and the given batch number

_resolve_results_dir(config, results_dir: pathlib.Path | str | None, verb: str | None) pathlib.Path[source]

Initialize an inference results directory as a data source. Accepts an override of what directory to use

class InferenceDataSetWriter(original_dataset: torch.utils.data.Dataset, result_dir: str | pathlib.Path)[source]

Class to write out inference datasets. Used by infer, umap to consistently write out numpy files in batches which can be read by InferenceDataSet.

With the exception of building ID->Batch indexing info, this is implemented as a bag-o-functions that manipulate the filesystem directly as their primary effect.

__init__()[source]
result_dir[source]
batch_index = 0[source]
id_dtype = None[source]
all_ids[source]
all_batch_nums[source]
writer_pool[source]
original_dataset_config[source]
write_batch(ids: numpy.ndarray, tensors: list[numpy.ndarray])[source]

Write a batch of tensors into the dataset. This writes the whole batch immediately. Caller is in charge of batch size consistency considerations, and that ids is the same length as tensors

Parameters:
  • ids (np.ndarray) – Array of IDs, dtype of the elements must match the dtype type of the ids of the original dataset used to construct this InferenceDataSetWriter.

  • tensors (list[np.ndarray]) – List of consistently dimensioned numpy arrays to save.

write_index()[source]

Writes out the batch index built up by this object over multiple write_batch calls. See save_batch_index for details.

_save_batch_index()[source]

Save a batch index in the result directory provided

_save_file(filename: str, data: numpy.ndarray)[source]

Save a numpy array to a file in the result directory provided