hyrax.vector_dbs.chromadb_impl

hyrax.vector_dbs.chromadb_impl#

Attributes#

`MIN_SHARDS_FOR_PARALLELIZATION`
`logger`

Classes#

ChromaDB

Implementation of the VectorDB interface using ChromaDB as the backend.

Functions#

`_query_for_nn`(results_dir, shard_name, vectors, k)	The query function for the ProcessPoolExecutor to query a shard for the
`_query_for_id`(results_dir, shard_name, id, include)	The query function for the ProcessPoolExecutor to query a shard for the

Module Contents#

MIN_SHARDS_FOR_PARALLELIZATION = 50[source]#

logger[source]#

_query_for_nn(results_dir: str, shard_name: str, vectors: list[numpy.ndarray], k: int)[source]#

The query function for the ProcessPoolExecutor to query a shard for the nearest neighbors of a set of vectors.

Parameters:

results_dir (str) – The directory where the ChromaDB results are stored
shard_name (str) – The name of the ChromaDB shard to load and query
vectors (np.ndarray) – The vectors used as inputs for the nearest neighbor search
k (int) – The number of nearest neighbors to return

Returns:

The results of the nearest neighbor search for the given vectors in the given shard.

Return type:

dict

_query_for_id(results_dir: str, shard_name: str, id: str | list[str], include: list[str] | None)[source]#

The query function for the ProcessPoolExecutor to query a shard for the vector associated with a given id.

Parameters:

results_dir (str) – The directory where the ChromaDB results are stored
shard_name (str) – The name of the ChromaDB shard to load and query
id (Union[str, list[str]]) – One or more ids of vectors in the database shard we are trying to retrieve
include (list[str], optional) – The fields to include in the results.

Returns:

The results of the query for the given ids in the given shard.

Return type:

dict

class ChromaDB(config, context)[source]#

Bases: hyrax.vector_dbs.vector_db_interface.VectorDB

Implementation of the VectorDB interface using ChromaDB as the backend.

__init__()[source]#

Create a new instance of a VectorDB object.

Parameters:

config (dict, optional) – An instance of the runtime configuration, by default None
context (dict, optional) – An instance of the context object, by default None

chromadb_client = None[source]#

collection = None[source]#

shard_index = 0[source]#

shard_size = 0[source]#

shard_size_limit[source]#

vector_size_limit[source]#

min_shards_for_parallelization = 50[source]#

connect()[source]#: Create a database connection

create()[source]#: Create a new database

insert(ids: list[str | int], vectors: list[numpy.ndarray])[source]#

Insert a batch of vectors into the database.

Parameters:

ids (list[Union[str | int]]) – The ids to associate with the vectors
vectors (list[np.ndarray]) – The vectors to insert into the database

search_by_id(id: str | int, k: int = 1) → dict[int, list[str | int]][source]#

Get the ids of the k nearest neighbors for a given id in the database.

Parameters:

id (Union[str | int]) – The id of the vector in the database for which we want to find the k nearest neighbors. If type int is provided, it will be converted to a string.
k (int, optional) – The number of nearest neighbors to return. By default 1, return only the closest neighbor - this is almost always the same as the input.

Returns:

Dictionary with input id as the key and the ids of the k nearest neighbors as the value. Because this function accepts only 1 id, the key will always be 0. i.e. {0: [id1, id2, …]}

Return type:

dict[int, list[Union[str, int]]]

Raises:

ValueError – If more than one vector is found for the given id

search_by_vector(vectors: numpy.ndarray | list[numpy.ndarray], k: int = 1) → dict[int, list[str | int]][source]#

Get the ids of the k nearest neighbors for a given vector.

Parameters:

vectors (Union[np.ndarray, list[np.ndarray]]) – The vector to use when searching for nearest neighbors
k (int, optional) – The number of nearest neighbors to return, by default 1, return only the closest neighbor

Returns:

Dictionary with input vector index as the key and the ids of the k nearest neighbors as the value.

Return type:

dict[int, list[Union[str, int]]]

get_by_id(ids: list[str | int]) → dict[str | int, list[float]][source]#

Retrieve the vectors associated with a list of ids.

Parameters:: ids (list[Union[str, int]]) – The ids of the vectors to retrieve. For ChromaDB instances, these should always be strings.
Returns:: Dictionary with the ids as the keys and the vectors as the values.
Return type:: dict[str, list[float]]

_get_ids(ids: list[str | int]) → set[str][source]#

For the given list of ids, return the ids that are already in the database.

Parameters:: ids (list[Union[str, int]]) – The ids of the vectors to retrieve. For ChromaDB instances, these should always be strings.
Returns:: Set of ids that are already in the database.
Return type:: set(str)