data_structures.hashing.bloom_filter¶
See https://en.wikipedia.org/wiki/Bloom_filter
The use of this data structure is to test membership in a set. Compared to Python’s built-in set() it is more space-efficient. In the following example, only 8 bits of memory will be used: >>> bloom = Bloom(size=8)
Initially, the filter contains all zeros: >>> bloom.bitstring ‘00000000’
When an element is added, two bits are set to 1 since there are 2 hash functions in this implementation: >>> “Titanic” in bloom False >>> bloom.add(“Titanic”) >>> bloom.bitstring ‘01100000’ >>> “Titanic” in bloom True
However, sometimes only one bit is added because both hash functions return the same value >>> bloom.add(“Avatar”) >>> “Avatar” in bloom True >>> bloom.format_hash(“Avatar”) ‘00000100’ >>> bloom.bitstring ‘01100100’
Not added elements should return False … >>> not_present_films = (“The Godfather”, “Interstellar”, “Parasite”, “Pulp Fiction”) >>> { … film: bloom.format_hash(film) for film in not_present_films … } # doctest: +NORMALIZE_WHITESPACE {‘The Godfather’: ‘00000101’,
‘Interstellar’: ‘00000011’, ‘Parasite’: ‘00010010’, ‘Pulp Fiction’: ‘10000100’}
>>> any(film in bloom for film in not_present_films)
False
but sometimes there are false positives: >>> “Ratatouille” in bloom True >>> bloom.format_hash(“Ratatouille”) ‘01100000’
The probability increases with the number of elements added. The probability decreases with the number of bits in the bitarray. >>> bloom.estimated_error_rate 0.140625 >>> bloom.add(“The Godfather”) >>> bloom.estimated_error_rate 0.25 >>> bloom.bitstring ‘01100101’
Attributes¶
Classes¶
Module Contents¶
- class data_structures.hashing.bloom_filter.Bloom(size: int = 8)¶
- __contains__(other: str) bool ¶
- add(value: str) None ¶
- exists(value: str) bool ¶
- format_bin(bitarray: int) str ¶
- format_hash(value: str) str ¶
- hash_(value: str) int ¶
- bitarray = 0¶
- property bitstring: str¶
- property estimated_error_rate: float¶
- size¶
- data_structures.hashing.bloom_filter.HASH_FUNCTIONS¶