Vector Similarity Metrics

The vector search Metric defines the distance measure between vectors during the vector search. From a mathematical point of view, the K nearest neighbors are those with the smallest distance from the search query vector.

In vector search higher score reflects greater similarity, or smaller distance. Thus, a algebraic conversion from distance to score is required. This relation is given in the Hyperspace Score column.

Metric Distance Measure Hyperspace Score

l2

d(X,Y)=i(xiyi)2d(X, Y) = \sum_{i} (x_i - y_i)^2

11+d\frac{1}{1 + d}

ip

d(X,Y)=ixiyid(X,Y)=-∑_i x _i ⋅y_ i

{11+dif d01dd<0\begin{cases} \frac{1}{1 + d} & \text{if } d \geq 0 \\ 1 - d & \text{d<0} \end{cases}

hamming

d(X,Y)=i=1nδ(xi,yi)d(X,Y)= \sum_{i=1}^{n} \delta(x_i, y_i)

11+d\frac{1}{1 + d}

Euclidean Metric (l2)

The L2L2 metric quantifies the similarity as the distance between two points in a Euclidean space. The metric is derived from the Pythagorean theorem and represents the length of the shortest path between the two points:

L2(X,Y)=(ixiyi2)12L^2(X, Y) = \left( \sum_{i} |x_i - y_i|^2 \right)^{\frac{1}{2}}

Note that the L2L^2 metric is a special case of the LpL^p metric -

Lp(X,Y)=(ixiyip)1pL^p(X, Y) = \left( \sum_{i} |x_i - y_i|^p \right)^{\frac{1}{p}}

Hyperspace uses the squared L2L2 metric for calculation efficiency. This does not affect the order of the candidates.

Inner Product (ip)

The inner product metric quantifies the similarity between two vectors as -1 times the projection of one vector on the other, which is -1 if the vectors are parallel and 0 if they are perpendicular. The minus sign ensures that minimal distance corresponds to maximum similarity. In a Euclidean vector space, the inner product is:

IP(X,Y)=ixiyiIP(X,Y) =-∑_i x _i ⋅y_ i

Vectors must be normalized before using the IP metric.

Hamming Distance (hamming)

The Hamming distance metric quantifies the similarity between two lists by measuring the number of positions at which the corresponding symbols differ. The metric operates over binary strings, and counts the bits that are different between two binary vectors. The lower the Hamming distance, the more similar the strings are considered to be, and vice versa.

The hamming distance is given by

Hamming(X,Y)=i=1nδ(xi,yi)\text{Hamming}(X, Y) = \sum_{i=1}^{n} \delta(x_i, y_i)

where δ(xi,yi)={1if xi=yi0otherwise \delta(x_i, y_i) =\begin{cases} 1 & \text{if } x_i=y_i\\ 0 & \text{otherwise} \end{cases} is the Kronecker delta. The above formula counts the number of characters that are different between XX and YY .

Last updated