machine_learning.similarity_search¶

Similarity Search : https://en.wikipedia.org/wiki/Similarity_search Similarity search is a search algorithm for finding the nearest vector from vectors, used in natural language processing. In this algorithm, it calculates distance with euclidean distance and returns a list containing two data for each vector:

  1. the nearest vector

  2. distance between the vector and the nearest vector (float)

Functions¶

cosine_similarity(→ float)

Calculates cosine similarity between two data.

euclidean(→ float)

Calculates euclidean distance between two data.

similarity_search(→ list[list[list[float] | float]])

Module Contents¶

machine_learning.similarity_search.cosine_similarity(input_a: numpy.ndarray, input_b: numpy.ndarray) → float¶

Calculates cosine similarity between two data. :param input_a: ndarray of first vector. :param input_b: ndarray of second vector. :return: Cosine similarity of input_a and input_b. By using math.sqrt(),

result will be float.

>>> cosine_similarity(np.array([1]), np.array([1]))
1.0
>>> cosine_similarity(np.array([1, 2]), np.array([6, 32]))
0.9615239476408232
machine_learning.similarity_search.euclidean(input_a: numpy.ndarray, input_b: numpy.ndarray) → float¶

Calculates euclidean distance between two data. :param input_a: ndarray of first vector. :param input_b: ndarray of second vector. :return: Euclidean distance of input_a and input_b. By using math.sqrt(),

result will be float.

>>> euclidean(np.array([0]), np.array([1]))
1.0
>>> euclidean(np.array([0, 1]), np.array([1, 1]))
1.0
>>> euclidean(np.array([0, 0, 0]), np.array([0, 0, 1]))
1.0
machine_learning.similarity_search.similarity_search(dataset: numpy.ndarray, value_array: numpy.ndarray) → list[list[list[float] | float]]¶
Parameters:
  • dataset – Set containing the vectors. Should be ndarray.

  • value_array – vector/vectors we want to know the nearest vector from dataset.

Returns:

Result will be a list containing 1. the nearest vector 2. distance from the vector

>>> dataset = np.array([[0], [1], [2]])
>>> value_array = np.array([[0]])
>>> similarity_search(dataset, value_array)
[[[0], 0.0]]
>>> dataset = np.array([[0, 0], [1, 1], [2, 2]])
>>> value_array = np.array([[0, 1]])
>>> similarity_search(dataset, value_array)
[[[0, 0], 1.0]]
>>> dataset = np.array([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
>>> value_array = np.array([[0, 0, 1]])
>>> similarity_search(dataset, value_array)
[[[0, 0, 0], 1.0]]
>>> dataset = np.array([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
>>> value_array = np.array([[0, 0, 0], [0, 0, 1]])
>>> similarity_search(dataset, value_array)
[[[0, 0, 0], 0.0], [[0, 0, 0], 1.0]]

These are the errors that might occur:

1. If dimensions are different. For example, dataset has 2d array and value_array has 1d array: >>> dataset = np.array([[1]]) >>> value_array = np.array([1]) >>> similarity_search(dataset, value_array) Traceback (most recent call last):

…

ValueError: Wrong input data’s dimensions… dataset : 2, value_array : 1

2. If data’s shapes are different. For example, dataset has shape of (3, 2) and value_array has (2, 3). We are expecting same shapes of two arrays, so it is wrong. >>> dataset = np.array([[0, 0], [1, 1], [2, 2]]) >>> value_array = np.array([[0, 0, 0], [0, 0, 1]]) >>> similarity_search(dataset, value_array) Traceback (most recent call last):

…

ValueError: Wrong input data’s shape… dataset : 2, value_array : 3

3. If data types are different. When trying to compare, we are expecting same types so they should be same. If not, it’ll come up with errors. >>> dataset = np.array([[0, 0], [1, 1], [2, 2]], dtype=np.float32) >>> value_array = np.array([[0, 0], [0, 1]], dtype=np.int32) >>> similarity_search(dataset, value_array) # doctest: +NORMALIZE_WHITESPACE Traceback (most recent call last):

…

TypeError: Input data have different datatype… dataset : float32, value_array : int32

thealgorithms-python

Navigation

index.md

  • Contributing guidelines
  • Getting Started
  • Community Channels
  • List of Algorithms
  • MIT License
  • API Reference
    • maths
    • other
    • sorts
    • graphs
    • hashes
    • matrix
    • ciphers
    • geodesy
    • physics
    • quantum
    • strings
    • fractals
    • geometry
    • graphics
    • knapsack
    • searches
    • financial
    • blockchain
    • scheduling
    • conversions
    • electronics
    • fuzzy_logic
    • backtracking
    • audio_filters
    • file_transfer
    • project_euler
    • greedy_methods
    • linear_algebra
    • neural_network
    • boolean_algebra
    • computer_vision
    • data_structures
    • networking_flow
    • web_programming
    • bit_manipulation
    • data_compression
    • machine_learning
    • cellular_automata
    • genetic_algorithm
    • divide_and_conquer
    • linear_programming
    • dynamic_programming
    • digital_image_processing

Related Topics

  • Documentation overview
    • API Reference
      • machine_learning
        • Previous: machine_learning.sequential_minimum_optimization
        • Next: machine_learning.support_vector_machines
©2014, TheAlgorithms. | Powered by Sphinx 8.2.3 & Alabaster 1.0.0 | Page source