Last week I wrote about Bloom filters, giving there an example of simple implementation of that data structure in Python, for which I have used three hash functions from Python hashlib library, that in fact are cryptographic hash functions, so using them is such a way is an overkill. In the meantime I have implemented a Bloom filter in Java and for that purpose decided to design my own hash functions. Quick check using my favourite search engine DuckDuckGo (don’t you know it yet – read this Wikipedia article and give it a try, you will not regret) gave me an interesting article at Eternally Confuzzled about hash functions and their design. As a result those three pieces of code were brought into existence (JavaDoc omitted):
public static int indexValueHash(String s, int size) { int sum = 0; char[] sChar = s.toCharArray(); for (int i = 0; i < sChar.length; i++) { if (sChar.length == 1) { sum = sChar[i] ^ 1; } else { sum += (~i << sChar[i]) ^ size ^ 17; } } if (sum < 0) { sum >>>= 1; } return sum % size; }
public static int crossHash(String s, int size) { int sum = 0; char[] sChar = s.toCharArray(); for (int i = 0; i < sChar.length; i++) { if (i == 0 || i == sChar.length - 1 && sChar.length != 2) { sum += sChar[i] * ~3; } else { sum += (sChar[i] << sChar[sChar.length - i]) ^ ~size; } } if (sum < 0) { sum >>>= 1; } return sum % size; }
public static int primeHash(String s, int size) { int sum = 0; char[] sChar = s.toCharArray(); for (int i = 0; i < sChar.length; i++) { sum ^= (sChar[i] * PRIME_NUMBER << i); } if (sum < 0) { sum >>>= 1; } return sum % size; }
I was basically playing with bitwise operations using the trial and error method. And then it came to testing. I was using two testing sets of data: a standard English dictionary and a set of random strings (length between 10 and 100 characters, containing all letters, digits and punctuation marks), 99171 elements each. The first impression was rather disappointing – about 62000 unique hashes out of over 99000 words does not seem to be particularly impressive result. But to be sure I needed some reference point, so decided to use some of the hash functions from the article I was previously referring to. The first trial confirmed my concerns, as the control hashes were almost all unique. However, when I have looked at the raw data it turned out that the set contains many negative numbers that cannot be used as indexes in an array, and also the length of many of them exceeds 100000, that is the size of the target array. A few little modifications to fix the problem and, voilà, those algorithms gave me results similar to those of my algorithms.
So, now it was time for some more test. I was wondering whether those hashes are evenly spread or clustered. Since one picture is worth a thousand words, here it is, a scatter plot of one of my hash functions.

Here is also a bit of statistics:
simpleHash | crossHash | indexValueHash | primeHash | |
Mean | 49689 | 50325 | 49514 | 50045 |
Standard Error | 92.06848 | 91.29671 | 92.40697 | 91.69557 |
Mode | 51588 | 25082 | 26707 | 66030 |
Median | 49458 | 50367 | 48836 | 50136 |
First Quartile | 24416 | 25711 | 24186.5 | 25030 |
Third Quartile | 74802 | 75589 | 75068.5 | 75031 |
Variance | 840633475 | 826599200 | 846825922 | 833837482 |
Standard Deviation | 28993.67992 | 28750.63826 | 29100.27357 | 28876.24425 |
Kurtosis | -1.20208 | -1.20015 | -1.22006 | -1.20034 |
Skewness | 0.01261 | -0.00921 | 0.03620 | -0.00341 |
Range | 99998 | 99996 | 99999 | 99998 |
Minimum | 1 | 3 | 0 | 1 |
Maximum | 99999 | 99999 | 99999 | 99999 |
Sum | 4927706291 | 4990732678 | 4910401609 | 4963038133 |
Count | 99171 | 99171 | 99171 | 99171 |
I was also curious if my algorithms ever give me a set of hashes without any collisions, so I have run them in a loop for array sizes between 100000 and the maximum size of integer that Java can handle with step of size 10000. That was quite a task for my computer (Intel Core i5-4460), and it took it 2,185 minutes and 51 seconds to finish it, but it was worth that time.
simpleHash |
crossHash |
indexValueHash |
primeHash |
|
dictionary | 98868 | 98914 | 96606 | 99169 |
random | 99170 | 99171 | 99171 | 99169 |
For the crossHash function the required minimum array size for no collision hashing is 715240000, and in the case of the indexValueHash it is 728570000. Of course both only works that way for random data, as none of my functions achieved this with a dictionary. But, at the end of the day they are not that bad for the very first time I was playing with such algorithms. So, if you like them, feel free to use them (just read the licence ;-)).