Sunday, April 6, 2014

CS131: Hash Tables

A hash table (also called map or dictionary) is an abstract collection of items where any data item can be stored and accessed with an associated key. Both data and key can be of any type.

The word hash is used as a term for transformation of a given key.

Direct address tables

If total number of possible items are reasonably small we may use direct address table where key is the index of the data array. Array is a direct address table.

Search()

Insert()

Delete()

Hash function

When direct address table is not a feasible option, for example when possible keys are too big, we use a function that takes the key k and return the index i of the table.

        i = h(k)   [ 0 <= i <= M - where M is total number of buckets in the table]

If the possible number of keys are bigger than available buckets in the hash table two keys may map the same index. This situation is called collision.

Let us consider the following keys

5, 13, 15, 17, 21, 24

We define the hash function as

    h(k) = k mod M
                  5  13
+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+

Fig: The mapping of the keys for M=7

Now for the first five keys (5, 13, 15, 17, 21) we can put the keys in the hash table buckets without any problem. But if we try to insert 24 we find 24 mod 7 = 3 and we have already 17 mapped to the 3rd bucket. We will look at a few techniques to solve the problem.

It is possible to create a very big hash table to minimize the number of collision. But we will be wasting a lot of memory if the table is mostly empty. So we want the average number of elements per bucket, which is called load factor, to be larger.

Load factor = n/M

The bigger the load factor the less wasted space.

Choosing a hash method

A good hash function should minimize collision, be easy to compute and distribute keys evenly through the available buckets.

The division method

h(k) = k mod M

For a good choice of M take a prime number that is distant from the values those are power of 2.

If M = 2^p the function will map two keys with same last character/digit to the same bucket [verify]

If M = 2^p-1 the function maps keys with same set of characters / digits to the same bucket. [verify]

The multiplication method

For all k h(k) = floor(M(kA-(floor(kA)) - where A is a constant with value range from 0 to 1. The value of M is not critical here.

Collision resolution by chaining

When two keys maps to same index we can use a linked list for that key to store multiple values. If there are M buckets we can maintain M linked list each of which may contain zero or more items.

Head(1) [*]-->[Value 1 | * ]-->[Value 2 | $ ]
Head(2) [*]-->[Value 3 | $ ]
Head(3) [*]-->[Value 5 | * ]-->[Value 6 | * ]-->[Value 7 | $ ]
Head(4) [$]
Head(5) [*]-->[Value 8 | $ ]

Fig: Using linked list for collision resolution
I <-- H(k) +1   [1 <= I <= M]
If I is present in HashTable 
    Do
        If k = KEY[I] return I        
        I = Link[I]
    While (I != 0)

Open addressing with Linear probing and insertion

Open addressing with Double hashing

Deletion

We can not simply delete a key from hash table since there could be more than one keys that hashed to the same bucket.

Deletion with linear probing

Universal hashing

For any hash function it is possible that someone will be able to come up with a set of keys such that every key maps to the same bucket or a set of very small number of buckets making the hash table a linked list and accessing any item will take O(n) time instead of expected O(1) time.

To solve this problem we can define a set of different hash functions and pick one function randomly when we initialize the hash table before first use. This technique is called universal hashing and operation on any element is expected to take O(1) average time.

Once chosen the hash function is not changed for the lifetime of the hash table so that each key hashed to the same bucket.

Linear Hashing

The hash table grows or shrinks as items are inserted or deleted from the hash table. It is not related to linear probing.

If N is initially chooses as number of buckets the number of total bucket of the hash table is chosen as 2N, 4N, 8N, 16N etc. that is, power of 2 * N. The power used is called level. So total number of bucket is 2^level * N. Level 2 has 4N buckets.

Linear hashing is implemented using dynamic array, a variable size array that allows random access of keys. When the size of array is changed

Extendible Hashing

Growing the size of hash table requires rehashing the keys and could be big performance hit when it is done. To avoid this situation, extendible hashing technique uses a trie as a hierarchical storage so that the rehashing can be done incrementally, one bucket at a time, as the hash table grows.