Bloom filters explained

yourbasic.org

A Bloom filter is a probabilistic data structure used to test set membership.

Bloom filter basics
Example: Blocking shady websites
Implementation
Performance

Bloom filter basics

The Bloom filter data structure tells whether an element may be in a set, or definitely isn’t. The only possible errors are false positives: a search for a nonexistent element can give an incorrect answer. With more elements in the filter, the error rate increases.

Bloom filters are both fast and space-efficient. However, elements can only be added, not removed.

Content distribution networks use them to avoid caching one-hit wonders, files that are seen only once. Web browsers use them to check for potentially harmful URLs.

Example: Blocking shady websites

This piece of code, which uses the Go Bloom filter implementation in package github.com/yourbasic/bloom, shows a typical Bloom filter use case.

// A Bloom filter with room for at most 100000 elements.
// The error rate for the filter is less than 1/200.
blocklist := bloom.New(10000, 200)

// Add an element to the filter.
url := "https://rascal.com"
blocklist.Add(url) 

// Test for membership.
if blocklist.Test(url) {
	fmt.Println(url, "may be blocked.")
} else {
	fmt.Println(url, "has not been blocked.")
}

https://rascal.com may be blocked.

Implementation

An empty Bloom filter is a bit array of m bits, all set to 0. There are also k different hash functions, each of which maps a set element to one of the m bit positions.

To add an element, feed it to the hash functions to get k bit positions, and set the bits at these positions to 1.
To test if an element is in the set, feed it to the hash functions to get k bit positions.
- If any of the bits at these positions is 0, the element definitely isn’t the set.
- If all are 1, then the element may be in the set.

Example

Here is a Bloom filter with three elements x, y and z. It consists of 18 bits and uses 3 hash functions. The colored arrows point to the bits that the elements of the set are mapped to.

The element w definitely isn’t in the set, since it hashes to a bit position containing 0.

Try this cool interactive Bloom filter demonstration by Jason Davies!

Performance

For a fixed error rate, adding a new element and testing for membership are both constant time operations, and a filter with room for n elements requires O(n) space.

In fact, a filter with error rate 1/p can be implemented with 0.26⋅ln(p) bytes per element using ⌈1.4⋅ln(p)⌉ bit array lookups per test.

This means that the example code above, with 100,000 entries and error rate 1/200, can be implemented using 135 KiB and 8 hash functions.

The Wikipedia Bloom filter article has a detailed mathematical analysis.