Last week I wrote about Bloom filters, giving there an example of simple implementation of that data structure in Python, for which I have used three hash functions from Python hashlib library, that in fact are cryptographic hash functions, so using them is such a way is an overkill. In the meantime I have implemented a Bloom filter in Java and for that purpose decided to design my own hash functions. Quick check using my favourite search engine DuckDuckGo (don’t you know it yet – read this Wikipedia article and give it a try, you will not regret) gave me an interesting article at Eternally Confuzzled about hash functions and their design. As a result those three pieces of code were brought into existence (JavaDoc omitted):
public static int indexValueHash(String s, int size) { int sum = 0; char[] sChar = s.toCharArray(); for (int i = 0; i < sChar.length; i++) { if (sChar.length == 1) { sum = sChar[i] ^ 1; } else { sum += (~i << sChar[i]) ^ size ^ 17; } } if (sum < 0) { sum >>>= 1; } return sum % size; }
public static int crossHash(String s, int size) { int sum = 0; char[] sChar = s.toCharArray(); for (int i = 0; i < sChar.length; i++) { if (i == 0  i == sChar.length  1 && sChar.length != 2) { sum += sChar[i] * ~3; } else { sum += (sChar[i] << sChar[sChar.length  i]) ^ ~size; } } if (sum < 0) { sum >>>= 1; } return sum % size; }
public static int primeHash(String s, int size) { int sum = 0; char[] sChar = s.toCharArray(); for (int i = 0; i < sChar.length; i++) { sum ^= (sChar[i] * PRIME_NUMBER << i); } if (sum < 0) { sum >>>= 1; } return sum % size; }
I was basically playing with bitwise operations using the trial and error method. And then it came to testing. I was using two testing sets of data: a standard English dictionary and a set of random strings (length between 10 and 100 characters, containing all letters, digits and punctuation marks), 99171 elements each. The first impression was rather disappointing – about 62000 unique hashes out of over 99000 words does not seem to be particularly impressive result. But to be sure I needed some reference point, so decided to use some of the hash functions from the article I was previously referring to. The first trial confirmed my concerns, as the control hashes were almost all unique. However, when I have looked at the raw data it turned out that the set contains many negative numbers that cannot be used as indexes in an array, and also the length of many of them exceeds 100000, that is the size of the target array. A few little modifications to fix the problem and, voilà, those algorithms gave me results similar to those of my algorithms.
So, now it was time for some more test. I was wondering whether those hashes are evenly spread or clustered. Since one picture is worth a thousand words, here it is, a scatter plot of one of my hash functions.
Here is also a bit of statistics:
simpleHash  crossHash  indexValueHash  primeHash  
Mean  49689  50325  49514  50045 
Standard Error  92.06848  91.29671  92.40697  91.69557 
Mode  51588  25082  26707  66030 
Median  49458  50367  48836  50136 
First Quartile  24416  25711  24186.5  25030 
Third Quartile  74802  75589  75068.5  75031 
Variance  840633475  826599200  846825922  833837482 
Standard Deviation  28993.67992  28750.63826  29100.27357  28876.24425 
Kurtosis  1.20208  1.20015  1.22006  1.20034 
Skewness  0.01261  0.00921  0.03620  0.00341 
Range  99998  99996  99999  99998 
Minimum  1  3  0  1 
Maximum  99999  99999  99999  99999 
Sum  4927706291  4990732678  4910401609  4963038133 
Count  99171  99171  99171  99171 
I was also curious if my algorithms ever give me a set of hashes without any collisions, so I have run them in a loop for array sizes between 100000 and the maximum size of integer that Java can handle with step of size 10000. That was quite a task for my computer (Intel Core i54460), and it took it 2,185 minutes and 51 seconds to finish it, but it was worth that time.
simpleHash 
crossHash 
indexValueHash 
primeHash 

dictionary  98868  98914  96606  99169 
random  99170  99171  99171  99169 
For the crossHash function the required minimum array size for no collision hashing is 715240000, and in the case of the indexValueHash it is 728570000. Of course both only works that way for random data, as none of my functions achieved this with a dictionary. But, at the end of the day they are not that bad for the very first time I was playing with such algorithms. So, if you like them, feel free to use them (just read the licence ;)).