data_structures.hashing.bloom_filter

See https://en.wikipedia.org/wiki/Bloom_filter

The use of this data structure is to test membership in a set. Compared to Python’s built-in set() it is more space-efficient. In the following example, only 8 bits of memory will be used: >>> bloom = Bloom(size=8)

Initially, the filter contains all zeros: >>> bloom.bitstring ‘00000000’

When an element is added, two bits are set to 1 since there are 2 hash functions in this implementation: >>> “Titanic” in bloom False >>> bloom.add(“Titanic”) >>> bloom.bitstring ‘01100000’ >>> “Titanic” in bloom True

However, sometimes only one bit is added because both hash functions return the same value >>> bloom.add(“Avatar”) >>> “Avatar” in bloom True >>> bloom.format_hash(“Avatar”) ‘00000100’ >>> bloom.bitstring ‘01100100’

Not added elements should return False … >>> not_present_films = (“The Godfather”, “Interstellar”, “Parasite”, “Pulp Fiction”) >>> { … film: bloom.format_hash(film) for film in not_present_films … } # doctest: +NORMALIZE_WHITESPACE {‘The Godfather’: ‘00000101’,

‘Interstellar’: ‘00000011’, ‘Parasite’: ‘00010010’, ‘Pulp Fiction’: ‘10000100’}

>>> any(film in bloom for film in not_present_films)
False

but sometimes there are false positives: >>> “Ratatouille” in bloom True >>> bloom.format_hash(“Ratatouille”) ‘01100000’

The probability increases with the number of elements added. The probability decreases with the number of bits in the bitarray. >>> bloom.estimated_error_rate 0.140625 >>> bloom.add(“The Godfather”) >>> bloom.estimated_error_rate 0.25 >>> bloom.bitstring ‘01100101’

Attributes

HASH_FUNCTIONS

Classes

Bloom

Module Contents

class data_structures.hashing.bloom_filter.Bloom(size: int = 8)
__contains__(other: str) bool
add(value: str) None
exists(value: str) bool
format_bin(bitarray: int) str
format_hash(value: str) str
hash_(value: str) int
bitarray = 0
property bitstring: str
property estimated_error_rate: float
size = 8
data_structures.hashing.bloom_filter.HASH_FUNCTIONS