Hash tables explained [step-by-step example]

yourbasic.org
file cabinet

Basics

Hash tables are used to implement map and set data structures in most common programming languages. In C++ and Java they are part of the standard libraries, while Python and Go have builtin dictionaries and maps.

A hash table is an unordered collection of key-value pairs, where each key is unique.

Hash tables offer a combination of efficient lookup, insert and delete operations. Neither arrays nor linked lists can achieve this:

Hashing with chaining (simplified example)

The most common hash table implementation uses chaining with linked lists to resolve collisions. This combines the best properties of arrays and linked lists.

hash table
Hash table with names as keys and phone numbers as values.

Let’s start with a somewhat simplified example: a data structure that can store up to 1000 records with random integer keys.

To distribute the data evenly, we use several short lists. All records with keys that end with 000 belong to one list, those with keys that end with 001 belong to another one, and so on. There is a total of 1000 such lists. This structure can be represented as an array of lists:

var table = new LinkedList[1000]

where LinkedList denotes a linked list of key-value pairs.

Inserting a new record (key, value) is a two-step procedure:

hash = key % 1000
table[hash].AddFirst(key, value)

This is a constant time operation.

A lookup is implemented by

value = table[key%1000].Find(key)

Since the numbers are random, there will be roughly the same number of records in each list. Since there are 1000 lists and at most 1000 records, there will likely be very few records in the list table[key%1000] and therefore the lookup operation will be fast.

The average time complexity of both the lookup and insert operations are O(1). Using the same technique, deletion can also be implemented in constant average time.

Realistic hash function example

We want to generalize this basic idea to more complicated keys that aren’t evenly distributed. The number of records in each list must remain small, and the records must be evenly distributed over the lists. To achieve this we just need to change the hash function, the function which selects the list where a key belongs.

The hash function in the example above is hash = key % 1000. It takes a key (a positive integer) as input and produces a number in the interval 0..999.

In general, a hash function is a function from E to 0..size-1, where E is the set of all possible keys, and size is the number of entry points in the hash table. We want this function to be uniform: it should map the expected inputs as evenly as possible over its output range.

Java’s implementation of hash functions for strings is a good example. The hashCode method in the String class computes the value

s[0]·31n-1 + s[1]·31n-2 + … + s[n-1]

using int arithmetic, where s[i] is the i:th character of the string, and n is the length of the string.

This method can be used as a hash function like this:

hash = Math.abs(s.hashCode() % size)

where size is the number of entry points in the hash table.

Note that this function depends on all characters in the string, and that the value changes when we change the order of the characters. Two properties that should hold for a good hash function.

Resizing in constant amortized time

The efficiency of a hash table depends on the fact that the table size is proportional to the number of records. If the number of records is not known in advance, the table must be resized when the lists become too long:

If the table size is increased by a constant factor for each resizing, i.e. by doubling its size, this strategy gives amortized constant time performance for lookups, insertions and deletions.

For more on the performance of this strategy, see Amortized time complexity.

Share this page: