machine_learning.similarity_search¶

Similarity Search : https://en.wikipedia.org/wiki/Similarity_search Similarity search is a search algorithm for finding the nearest vector from vectors, used in natural language processing. In this algorithm, it calculates distance with euclidean distance and returns a list containing two data for each vector:

the nearest vector

distance between the vector and the nearest vector (float)

Functions¶

`cosine_similarity`(→ float)	Calculates cosine similarity between two data.
`euclidean`(→ float)	Calculates euclidean distance between two data.
`similarity_search`(→ list[list[list[float] \| float]])

Module Contents¶

machine_learning.similarity_search.cosine_similarity(input_a: numpy.ndarray, input_b: numpy.ndarray) → float¶

Calculates cosine similarity between two data. :param input_a: ndarray of first vector. :param input_b: ndarray of second vector. :return: Cosine similarity of input_a and input_b. By using math.sqrt(),

result will be float.

>>> cosine_similarity(np.array([1]), np.array([1]))
1.0
>>> cosine_similarity(np.array([1, 2]), np.array([6, 32]))
0.9615239476408232

machine_learning.similarity_search.euclidean(input_a: numpy.ndarray, input_b: numpy.ndarray) → float¶

Calculates euclidean distance between two data. :param input_a: ndarray of first vector. :param input_b: ndarray of second vector. :return: Euclidean distance of input_a and input_b. By using math.sqrt(),

result will be float.

>>> euclidean(np.array([0]), np.array([1]))
1.0
>>> euclidean(np.array([0, 1]), np.array([1, 1]))
1.0
>>> euclidean(np.array([0, 0, 0]), np.array([0, 0, 1]))
1.0

machine_learning.similarity_search.similarity_search(dataset: numpy.ndarray, value_array: numpy.ndarray) → list[list[list[float] | float]]¶

Parameters:

dataset – Set containing the vectors. Should be ndarray.
value_array – vector/vectors we want to know the nearest vector from dataset.

Returns:

Result will be a list containing 1. the nearest vector 2. distance from the vector

>>> dataset = np.array([[0], [1], [2]])
>>> value_array = np.array([[0]])
>>> similarity_search(dataset, value_array)
[[[0], 0.0]]

>>> dataset = np.array([[0, 0], [1, 1], [2, 2]])
>>> value_array = np.array([[0, 1]])
>>> similarity_search(dataset, value_array)
[[[0, 0], 1.0]]

>>> dataset = np.array([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
>>> value_array = np.array([[0, 0, 1]])
>>> similarity_search(dataset, value_array)
[[[0, 0, 0], 1.0]]

>>> dataset = np.array([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
>>> value_array = np.array([[0, 0, 0], [0, 0, 1]])
>>> similarity_search(dataset, value_array)
[[[0, 0, 0], 0.0], [[0, 0, 0], 1.0]]

These are the errors that might occur:

1. If dimensions are different. For example, dataset has 2d array and value_array has 1d array: >>> dataset = np.array([[1]]) >>> value_array = np.array([1]) >>> similarity_search(dataset, value_array) Traceback (most recent call last):

…

ValueError: Wrong input data’s dimensions… dataset : 2, value_array : 1

2. If data’s shapes are different. For example, dataset has shape of (3, 2) and value_array has (2, 3). We are expecting same shapes of two arrays, so it is wrong. >>> dataset = np.array([[0, 0], [1, 1], [2, 2]]) >>> value_array = np.array([[0, 0, 0], [0, 0, 1]]) >>> similarity_search(dataset, value_array) Traceback (most recent call last):

…

ValueError: Wrong input data’s shape… dataset : 2, value_array : 3

3. If data types are different. When trying to compare, we are expecting same types so they should be same. If not, it’ll come up with errors. >>> dataset = np.array([[0, 0], [1, 1], [2, 2]], dtype=np.float32) >>> value_array = np.array([[0, 0], [0, 1]], dtype=np.int32) >>> similarity_search(dataset, value_array) # doctest: +NORMALIZE_WHITESPACE Traceback (most recent call last):

…

TypeError: Input data have different datatype… dataset : float32, value_array : int32

machine_learning.similarity_search¶

Functions¶

Module Contents¶

thealgorithms-python

Navigation

Related Topics