hyrax.datasets.result_dataset#

Lance-based storage for inference results.

This module provides ResultDataset and ResultDatasetWriter classes that store inference results in Lance columnar format instead of batched .npy files.

Attributes#

Classes#

ResultDatasetWriter

Writer for Lance-based inference results.

ResultDataset

Reader for Lance-based inference results.

Module Contents#

logger[source]#
TABLE_NAME = 'results'[source]#
LANCE_DB_DIR = 'lance_db'[source]#
class ResultDatasetWriter(result_dir: str | pathlib.Path)[source]#

Writer for Lance-based inference results.

Writes inference results incrementally to Lance format using table.add() for each batch, avoiding memory accumulation.

Initialize the writer.

Parameters:

result_dir (Union[str, Path]) – Directory where Lance database will be created

result_dir[source]#
lance_dir[source]#
db = None[source]#
table = None[source]#
schema = None[source]#
tensor_dtype = None[source]#
tensor_shape = None[source]#
batch_count = 0[source]#
write_batch(object_ids: numpy.ndarray, data: list[numpy.ndarray])[source]#

Write a batch of results incrementally.

Parameters:
  • object_ids (np.ndarray) – Array of object IDs (will be converted to strings)

  • data (list[np.ndarray]) – List of numpy arrays (tensors) to write

commit()[source]#

Finalize the write by optimizing the table.

_create_schema(sample_tensor: numpy.ndarray)[source]#

Create PyArrow schema with tensor metadata.

Parameters:

sample_tensor (np.ndarray) – Sample tensor to determine dtype and shape

class ResultDataset(config: dict, data_location: pathlib.Path | str)[source]#

Bases: hyrax.datasets.dataset_registry.HyraxDataset

Reader for Lance-based inference results.

Provides HyraxQL-compatible getters to results stored in Lance format.

Initialize the dataset.

Parameters:
  • config (dict) – Hyrax configuration dictionary

  • data_location (Union[Path, str]) – Path to results directory containing lance_db/

data_location[source]#
lance_dir[source]#
db[source]#
table[source]#
lance_dataset[source]#
tensor_shape[source]#
tensor_dtype[source]#
__len__() int[source]#

Return the number of records in the dataset.

__getitem__(idx: int | numpy.ndarray)[source]#

Get data by index.

Parameters:

idx (Union[int, np.ndarray]) – Single index or array of indices

Returns:

Data tensor(s)

Return type:

np.ndarray

Raises:

IndexError – If index is out of range

__get_all__()[source]#

Get all data tensors in the dataset.

This is a specialized method that is meant for internal use (e.g. visualize_v2). It retrieves all tensors efficiently by assuming column names and accessing the array buffer directly, without creating Python objects for each row.

Returns:

All data tensors

Return type:

np.ndarray

get_data(idx: int)[source]#

Get data tensor at index (HyraxQL getter).

Parameters:

idx (int) – Index of the data item

Returns:

Data tensor

Return type:

np.ndarray

get_object_id(idx: int) str[source]#

Get object ID at index (HyraxQL getter).

Parameters:

idx (int) – Index of the data item

Returns:

Object ID

Return type:

str

ids() list[str][source]#

Generate all object IDs.

Returns:

Object IDs in order

Return type:

list[str]