Imagine you are creating a user registration module for your website and you need a way of checking that the chosen user name is available. What are your options? Sequential search with its O(n) worst-case run-time complexity? Think about those hundreds of thousands or even millions of user you will have one day! What about binary search? O(log n) sounds better, but still… Then, eureka! Searching for an item in a hash table is just O(1). “However, as the table fills up things become a little more complicated, as rehashing and probing may become necessary” (Dobbyn, 2016). No worries, there is something better – Bloom filters.
A Bloom filter is a data structure that allows holding a huge amount of data in a very limited space and is used for checking if an element is in a set. The way it works is simple. It is an array of bits where initially every bit is set to 0. Every time we want to add an element to the filter, its collection of independent hash functions generates hashes of that element that are turned into integers used as indexes in the bit array of which bits in those positions are changed to 1. To check if an element exists in the set, the same process is used, but this time the filter checks if bits match those in array. Let’s have a look at simple implementation of a Bloom filter and an example of usage to better grasp the idea.
import hashlib def bloom_filter(keyword, bit_array, search = False): hashes = [hashlib.md5(keyword.encode()), hashlib.sha1(keyword.encode()), hashlib.sha224(keyword.encode())] for item in hashes: index = int(item.hexdigest(), 16) % len(bit_array) if search and bit_array[index] != 1: return False else: bit_array[index] = 1 return True my_bloom_filter = *10 bloom_filter('Iggy', my_bloom_filter) bloom_filter('Madeleine', my_bloom_filter) print(bloom_filter('Iggy', my_bloom_filter, True)) print(bloom_filter('Gallagher', my_bloom_filter, True)) print(bloom_filter('Abbey', my_bloom_filter, True))
Here is the output of the above code:
True False True
As can be seen, checking if element added to the filter (Iggy) exists returns True. Also, as expected, checking a keyword that have not been added (Gallagher) returns False, but with increasing number of 1s in the bit array also false positive, that is an error when filter confirms that element exists when it does not, is more likely (Abbey). How the last happens? The reason is the way a Bloom filter stores its data. As can be seen in the Figure 1, for names Iggy and Madeleine the filter sets to 1 bits positioned from 3 to 7. Because of that, any other element we check that will match three of those bits, will be confirmed as existing, even if it does not, as it is in the case of Abbey. To resolve this problem we have to use larger array, because then the 1s will be more scattered.
But the most important is that the opposite situation, that is false negative is impossible (Antognini, 2008). This and its efficiency, where for filter with k hashing functions, both insertion and membership testing are O(k) and because those k lookups are independent and can be parallelized, it makes it incredibly fast (Wikipedia, 2018), Bloom filters have many real-world applications, for example Google Chrome uses it to decide if a particular URL points to a website that is malicious or safe to browse (Yakunin, 2010).
By the way, Figure 1 shows one more important property of a Bloom filter. An element once added cannot be removed from the filter, as there is high probability of affecting some other elements, since more than one of them may share a bit with particular index, as it is in the case of the one with index 7 in our example. Possible solution here seems to be recreating the filter from scratch, excluding that element (Akyldiz, 2016).
Concluding, as far as I am aware there is no other algorithm/data structure but hash table that could compete with Bloom filters in terms of run-time efficiency. But even hash table will loose the competition if we take into consideration space efficiency. A Bloom filter with 1% false positive probability and optimal number of hashing functions needs less than 10 bits per element and the size of the element does not matter, where storing an integer in a hash table requires 32 bits (Srivastav, 2014). But there are trade-offs. Its size needs to be known a priori based on the number of elements that will be stored, which is not always possible to predict. We have already mentioned false positives and impossibility of removing an element once added. It is also impossible to get back a list of stored elements, just check if particular element exists. But still it is very useful data structure.
Dobbyn, Ch. (2016) ‘Hashing and hash tables’, M269 Unit 4: Searching [Online]. Available at https://learn2.open.ac.uk/mod/oucontent/view.php?id=1107848§ion=3 (Accessed 23 May 2018)
Antognini, Ch. (2008) Bloom Filters [Online], Zürich, Trivadis AG. Available at https://antognini.ch/papers/BloomFilters20080620.pdf (Accessed 26 March 2018)
Wikipedia (2018) Bloom filter [Online], 20 March 2018. Available at https://en.wikipedia.org/wiki/Bloom_filter (Accessed 26 March 2018)
Yakunin, A. (2010) ‘Nice Bloom filter application’, Alex Yakunin’s blog, 25 March [Blog]. Available at http://blog.alexyakunin.com/2010/03/nice-bloom-filter-application.html (Accessed 26 March 2018)
Akyldiz, B. (2016) ‘A Gentle Introduction to Bloom Filter’, bugra, 5 June [Blog]. Available at http://bugra.github.io/work/notes/2016-06-05/a-gentle-introduction-to-bloom-filter/ (Accessed 26 March 2018)
Srivastav, P. (2014) ‘Bloom filters for dummies’, Prakhar Srivastav, 19 October [Blog]. Available at https://prakhar.me/articles/bloom-filters-for-dummies/ (Accessed 26 March 2018)