maths.jaccard_similarity

The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let U be a set and A and B be subsets of U, then the Jaccard index/similarity is defined to be the ratio of the number of elements of their intersection and the number of elements of their union.

Inspired from Wikipedia and the book Mining of Massive Datasets [MMDS 2nd Edition, Chapter 3]

https://en.wikipedia.org/wiki/Jaccard_index https://mmds.org

Jaccard similarity is widely used with MinHashing.

Attributes

set_a

Functions

jaccard_similarity(set_a, set_b[, alternative_union])

Finds the jaccard similarity between two sets.

Module Contents

maths.jaccard_similarity.jaccard_similarity(set_a: set[str] | list[str] | tuple[str], set_b: set[str] | list[str] | tuple[str], alternative_union=False)

Finds the jaccard similarity between two sets. Essentially, its intersection over union.

The alternative way to calculate this is to take union as sum of the number of items in the two sets. This will lead to jaccard similarity of a set with itself be 1/2 instead of 1. [MMDS 2nd Edition, Page 77]

Parameters:
set_a (set,list,tuple):

A non-empty set/list

set_b (set,list,tuple):

A non-empty set/list

alternativeUnion (boolean):

If True, use sum of number of

items as union

Output:

(float) The jaccard similarity between the two sets.

Examples: >>> set_a = {‘a’, ‘b’, ‘c’, ‘d’, ‘e’} >>> set_b = {‘c’, ‘d’, ‘e’, ‘f’, ‘h’, ‘i’} >>> jaccard_similarity(set_a, set_b) 0.375 >>> jaccard_similarity(set_a, set_a) 1.0 >>> jaccard_similarity(set_a, set_a, True) 0.5 >>> set_a = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’] >>> set_b = (‘c’, ‘d’, ‘e’, ‘f’, ‘h’, ‘i’) >>> jaccard_similarity(set_a, set_b) 0.375 >>> set_a = (‘c’, ‘d’, ‘e’, ‘f’, ‘h’, ‘i’) >>> set_b = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’] >>> jaccard_similarity(set_a, set_b) 0.375 >>> set_a = (‘c’, ‘d’, ‘e’, ‘f’, ‘h’, ‘i’) >>> set_b = [‘a’, ‘b’, ‘c’, ‘d’] >>> jaccard_similarity(set_a, set_b, True) 0.2 >>> set_a = {‘a’, ‘b’} >>> set_b = [‘c’, ‘d’] >>> jaccard_similarity(set_a, set_b) Traceback (most recent call last):

ValueError: Set a and b must either both be sets or be either a list or a tuple.

maths.jaccard_similarity.set_a