Index Type Methods

The Index Type specifies the strategy by which data will be distributed in a collection. This influences the data storage volume, speed of search operations, and the search accuracy. Specifically, there is an inherent trade-off between these traits. For example, higher speed of operation leads to reduced accuracy or increased storage volume. Hyperspace supports multiple indexing methods, as explained below.

Brute Force (brute_force)

The brute force indexing method corresponds to precise K-Nearest Neighbors retrieval (KNN ), devoid of approximations. This approach ensures optimal accuracy, at the expense of a reduced search speed.

HNSW (hnsw)

Hierarchical Navigable Small World (HNSW) is an efficient approach for approximated nearest neighbor search in high-dimensional vector spaces. HNSW creates a hierarchical structure that organizes data points as graph nodes, in a manner similar to a small-world network. In this structure, each node has connections to selected nearest neighbors. The hierarchy allows for efficient navigation through the dataset, reducing the computational complexity of the search process.

HNSW offers a good balance between accuracy and search speed in large datasets with high dimensionality.

Parameters

  • 'm' (int, default 30): Specifies the number of arcs per new data point. A larger M value results in higher recall, but larger storage volume and slower search times. Thus, larger values of M are required for datasets that have higher dimensionality and/or require higher recall.

  • 'ef' (int, default 16) – Specifies the size of the dynamic list for nearest neighbors. A larger ef value results in better accuracy but slower search times. Essentially, by setting a larger ef, you're allowing the algorithm to consider more potential neighbors for a better match, but this comes at the cost of longer processing times. ef must be larger than the number of queried nearest neighbors (NN).

  • 'ef_construction' (int, default 360) – It is similar to ef, but used for index creation. Though the upload and index creation may require more time and storage volume, this option provides search outcome with higher recall.

Inverted File Index (ivf)

Inverted file index is a search optimization method that composed of partitioning the datasets and associate each with a centroid. Each database vector is then associated with a partition that corresponds to its nearest centroid. The vector components serve as dimensions in the vector space, and the inverted file index is constructed to map each dimension (or a combination of dimensions) to the documents that have vectors containing them.

Parameters

  • 'nlist' (int, default 128) – Only used for index_type = ivf or bin_ivf. This option is used during index creation and represents the number of buckets used during clustering. A larger nlist leads to quicker search with lower accuracy.

Last updated