machine_learning.similarity_search¶
Similarity Search : https://en.wikipedia.org/wiki/Similarity_search Similarity search is a search algorithm for finding the nearest vector from vectors, used in natural language processing. In this algorithm, it calculates distance with euclidean distance and returns a list containing two data for each vector:
the nearest vector
distance between the vector and the nearest vector (float)
Functions¶
|
Calculates cosine similarity between two data. |
|
Calculates euclidean distance between two data. |
|
Module Contents¶
- machine_learning.similarity_search.cosine_similarity(input_a: numpy.ndarray, input_b: numpy.ndarray) float ¶
Calculates cosine similarity between two data. :param input_a: ndarray of first vector. :param input_b: ndarray of second vector. :return: Cosine similarity of input_a and input_b. By using math.sqrt(),
result will be float.
>>> cosine_similarity(np.array([1]), np.array([1])) 1.0 >>> cosine_similarity(np.array([1, 2]), np.array([6, 32])) 0.9615239476408232
- machine_learning.similarity_search.euclidean(input_a: numpy.ndarray, input_b: numpy.ndarray) float ¶
Calculates euclidean distance between two data. :param input_a: ndarray of first vector. :param input_b: ndarray of second vector. :return: Euclidean distance of input_a and input_b. By using math.sqrt(),
result will be float.
>>> euclidean(np.array([0]), np.array([1])) 1.0 >>> euclidean(np.array([0, 1]), np.array([1, 1])) 1.0 >>> euclidean(np.array([0, 0, 0]), np.array([0, 0, 1])) 1.0
- machine_learning.similarity_search.similarity_search(dataset: numpy.ndarray, value_array: numpy.ndarray) list[list[list[float] | float]] ¶
- Parameters:
dataset – Set containing the vectors. Should be ndarray.
value_array – vector/vectors we want to know the nearest vector from dataset.
- Returns:
Result will be a list containing 1. the nearest vector 2. distance from the vector
>>> dataset = np.array([[0], [1], [2]]) >>> value_array = np.array([[0]]) >>> similarity_search(dataset, value_array) [[[0], 0.0]]
>>> dataset = np.array([[0, 0], [1, 1], [2, 2]]) >>> value_array = np.array([[0, 1]]) >>> similarity_search(dataset, value_array) [[[0, 0], 1.0]]
>>> dataset = np.array([[0, 0, 0], [1, 1, 1], [2, 2, 2]]) >>> value_array = np.array([[0, 0, 1]]) >>> similarity_search(dataset, value_array) [[[0, 0, 0], 1.0]]
>>> dataset = np.array([[0, 0, 0], [1, 1, 1], [2, 2, 2]]) >>> value_array = np.array([[0, 0, 0], [0, 0, 1]]) >>> similarity_search(dataset, value_array) [[[0, 0, 0], 0.0], [[0, 0, 0], 1.0]]
These are the errors that might occur:
1. If dimensions are different. For example, dataset has 2d array and value_array has 1d array: >>> dataset = np.array([[1]]) >>> value_array = np.array([1]) >>> similarity_search(dataset, value_array) Traceback (most recent call last):
…
ValueError: Wrong input data’s dimensions… dataset : 2, value_array : 1
2. If data’s shapes are different. For example, dataset has shape of (3, 2) and value_array has (2, 3). We are expecting same shapes of two arrays, so it is wrong. >>> dataset = np.array([[0, 0], [1, 1], [2, 2]]) >>> value_array = np.array([[0, 0, 0], [0, 0, 1]]) >>> similarity_search(dataset, value_array) Traceback (most recent call last):
…
ValueError: Wrong input data’s shape… dataset : 2, value_array : 3
3. If data types are different. When trying to compare, we are expecting same types so they should be same. If not, it’ll come up with errors. >>> dataset = np.array([[0, 0], [1, 1], [2, 2]], dtype=np.float32) >>> value_array = np.array([[0, 0], [0, 1]], dtype=np.int32) >>> similarity_search(dataset, value_array) # doctest: +NORMALIZE_WHITESPACE Traceback (most recent call last):
…
TypeError: Input data have different datatype… dataset : float32, value_array : int32