The art of hashing

Last week I wrote about Bloom filters, giving there an example of simple implementation of that data structure in Python, for which I have used three hash functions from Python hashlib library, that in fact are cryptographic hash functions, so using them is such a way is an overkill. In the meantime I have implemented a Bloom filter in Java and for that purpose decided to design my own hash functions. Quick check using my favourite search engine DuckDuckGo (don’t you know it yet – read this Wikipedia article and give it a try, you will not regret) gave me an interesting article at Eternally Confuzzled about hash functions and their design. As a result those three pieces of code were brought into existence (JavaDoc omitted):

public static int indexValueHash(String s, int size) {
    int sum = 0;
    char[] sChar = s.toCharArray();

    for (int i = 0; i < sChar.length; i++) {
        if (sChar.length == 1) {
            sum = sChar[i] ^ 1;
        } else {
            sum += (~i << sChar[i]) ^ size ^ 17;
        }
    }

    if (sum < 0) {
        sum >>>= 1;
    }

    return sum % size;
}
public static int crossHash(String s, int size) {
    int sum = 0;
    char[] sChar = s.toCharArray();

    for (int i = 0; i < sChar.length; i++) {
        if (i == 0 || i == sChar.length - 1 && sChar.length != 2) {
            sum += sChar[i] * ~3;
        } else {
            sum += (sChar[i] << sChar[sChar.length - i]) ^ ~size;
        } 
    }

    if (sum < 0) {
        sum >>>= 1;
    }

    return sum % size;
}
public static int primeHash(String s, int size) {
    int sum = 0;
    char[] sChar = s.toCharArray();

    for (int i = 0; i < sChar.length; i++) {
        sum ^= (sChar[i] * PRIME_NUMBER << i);
    }

    if (sum < 0) {
        sum >>>= 1;
    }

    return sum % size;
}

I was basically playing with bitwise operations using the trial and error method. And then it came to testing. I was using two testing sets of data: a standard English dictionary and a set of random strings (length between 10 and 100 characters, containing all letters, digits and punctuation marks), 99171 elements each. The first impression was rather disappointing – about 62000 unique hashes out of over 99000 words does not seem to be particularly impressive result. But to be sure I needed some reference point, so decided to use some of the hash functions from the article I was previously referring to. The first trial confirmed my concerns, as the control hashes were almost all unique. However, when I have looked at the raw data it turned out that the set contains many negative numbers that cannot be used as indexes in an array, and also the length of many of them exceeds 100000, that is the size of the target array. A few little modifications to fix the problem and, voilà, those algorithms gave me results similar to those of my algorithms.
So, now it was time for some more test. I was wondering whether those hashes are evenly spread or clustered. Since one picture is worth a thousand words, here it is, a scatter plot of one of my hash functions.

Figure 1: A scatterplot of the hashes for the English dictionary
Figure 1: A scatterplot of the hashes for the English dictionary

Here is also a bit of statistics:

simpleHash crossHash indexValueHash primeHash
Mean 49689 50325 49514 50045
Standard Error 92.06848 91.29671 92.40697 91.69557
Mode 51588 25082 26707 66030
Median 49458 50367 48836 50136
First Quartile 24416 25711 24186.5 25030
Third Quartile 74802 75589 75068.5 75031
Variance 840633475 826599200 846825922 833837482
Standard Deviation 28993.67992 28750.63826 29100.27357 28876.24425
Kurtosis -1.20208 -1.20015 -1.22006 -1.20034
Skewness 0.01261 -0.00921 0.03620 -0.00341
Range 99998 99996 99999 99998
Minimum 1 3 0 1
Maximum 99999 99999 99999 99999
Sum 4927706291 4990732678 4910401609 4963038133
Count 99171 99171 99171 99171

I was also curious if my algorithms ever give me a set of hashes without any collisions, so I have run them in a loop for array sizes between 100000 and the maximum size of integer that Java can handle with step of size 10000. That was quite a task for my computer (Intel Core i5-4460), and it took it 2,185 minutes and 51 seconds to finish it, but it was worth that time.

simpleHash
crossHash
indexValueHash
primeHash
dictionary 98868 98914 96606 99169
random 99170 99171 99171 99169

For the crossHash function the required minimum array size for no collision hashing is 715240000, and in the case of the indexValueHash it is 728570000. Of course both only works that way for random data, as none of my functions achieved this with a dictionary. But, at the end of the day they are not that bad for the very first time I was playing with such algorithms. So, if you like them, feel free to use them (just read the licence ;-)).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.