O(n log log n) time integer sorting
The fastest sorting algorithm?
Which sorting algorithm is the fastest? Ask this question to any group of programmers and you’ll get an animated discussion. Of course, there is no one answer. It depends not only on the algorithm, but also on the computer, data, and implementation. However, if you count the number of operations needed to sort integer numbers on a standard von Neumann computer, there is a clear winner – the algorithm presented in the paper “Sorting In Linear Time?” by A. Andersson, T. Hagerup, S. Nilsson, and R. Raman (Proceedings of the 27th Annual ACM Symposium on the Theory of Computing, 1995). It sorts n integers in time proportional to n log log n. In this article, I’ll give you a complete description of this algorithm.
Can it be done even faster? No one knows. We only know that it can’t possibly be done using less than n operations: An algorithm using fewer operations than that can’t look at each of the n numbers and, therefore, might leave some of the numbers out of order.
Even though the n log log n time-sorting algorithm came about as a theoretical game, its real-life performance is good. A C implementation like nloglogn.c with no particular optimizations runs faster on a typical 32-bit machine than many standard textbook sorting algorithms.
The problem
To achieve agreement when discussing the speed of a sorting algorithm, it’s necessary to establish some common ground. How do we measure speed? There is a well-established model – the von Neumann computer, or unit-cost RAM model – widely used by computer scientists to evaluate algorithms without introducing all the complexities of real-world computers. This model doesn’t account for some important things, such as the performance gap between RAM and CPU, but typically gives good approximations of real-life performance.
There is nothing strange about the unit-cost RAM model. It’s probably similar to your mental image of a computer. The computer has a RAM divided into words of w bits – the “word length” of the machine. Each word has an address, and a word can be accessed in constant time given its address. The computer has a CPU capable of performing a small number of instructions, such as reading and writing words in RAM, and performing basic arithmetical and logical operations. Each of these operations can be performed in a constant number of machine cycles.
If you don’t use the full power of the RAM model, but only consider sorting algorithms that operate by comparing elements pairwise, it’s well known that at least n log n comparisons are needed to sort n elements in the worst case. There are plenty of algorithms, such as mergesort and heapsort, that achieve this bound.
Other sorting techniques may be faster. For example, using radix sorting, you can sort n word-sized integers by scanning the data a few bits at a time. Each scan can be done in linear time, and the number of scans depends on the word length w. Hence, the time for radix sorting is proportional to nw. Here, it’s easy to make a mistake. On most computers, w is a small number (typically 32 or 64), and you might be tempted to state that radix sorting is a linear time sorting algorithm. Not so. The algorithm will only work if w ≥ log n. If not, you wouldn’t even be able to store the numbers. Since each memory address is a word consisting of w bits, the address space won’t accommodate n numbers if w < log n. Hence, in the unit-cost RAM model, radix sort also runs in time proportional to n log n.
I’ll now describe an algorithm that does not depend on the word length. For any word length w, w ≥ log n, it sorts n word-sized integers in time proportional to n log log n.
The algorithm
With this algorithm, the sorting is performed in two phases. In the first phase, you reduce the size of the numbers using radix sorting techniques. The elements become smaller and the number of elements doesn’t increase. In fact, in linear time, it’s possible to cut the number of bits of each number in half.
In the second phase, you perform a mergesort on the shortened numbers. Because the numbers are now much shorter, several numbers will fit in a single machine word. With several numbers in one word, it’s possible to do several operations on these numbers in parallel using the shift and bitwise AND/OR/XOR operations found on most computers. In fact, using a scheme known as “packed merging,” you can merge two sorted sequences of k numbers, each of which fits in one word, in time proportional to log k. This is an improvement on the standard merging algorithm, which requires time proportional to k.
Given these building blocks, we are almost done. In the first phase, you perform log log n reductions. This is done in n log log n time, because each reduction takes linear time. After halving the number of bits log log n times, the numbers will be so small that you can fit k = log n numbers within one word.
In the second phase, you use the packed merging as a subroutine in a standard mergesort. You perform the merging bottom up. Start by sorting subsequences of length k. Each of the n/k subsequences is sorted in k log k time using a standard comparison-based algorithm, such as mergesort. Plugging in k = log n, you see that this adds up to n log log n.
You now combine these sorted sequences pairwise using packed merging, and thereby create longer sorted sequences. Each sequence is packed in a single machine word. In this first round, you need to merge n/2k pairs, each merging is done in log k time using packed merging. This sums up to n/2k × log k.
In the next round, you need to merge n/4k pairs, each consisting of two words. This is done in n/4k × 2 log k = n/2k × log k time, the same time bound as before. In fact, you’ll get the same result for each round.
Because the length of the sequences double in each round, the number of rounds is log n/k. Hence, the total time for this version of mergesort is log n/k × n/2k × log k. Plugging in k = log n, you see that this expression is also no more than n log log n.
At this point, the data is sorted and no more than n log log n time has elapsed, since both the reduction phase and the mergesorting phase run in time proportional to n log log n.
Reducing number size
There are several alternative algorithms that can be used for reducing a sorting problem to one with shorter numbers. Which to choose depends on whether you want to optimize the rate of reduction, the speed of the algorithm, or the amount of memory required. Here, I’ll show you a simple algorithm that reduces the number of bits by a factor of 2. It runs in linear time, but uses plenty of memory.
Assume that you want to sort n b-bit numbers. I’ll use a bucketing scheme with two bucket tables: High and Low. Each table entry can hold a linked list of numbers. I’ll explain the algorithm using a small example. In this figure, you see seven 6-bit numbers to be sorted.
Data: 100101, 011001, 100111, 001100, 011111, 110111, 001000 High Low ------------------------ ------------------------ 000: 000: 001: 001100, 001000 001: 010: 010: 011: 011001, 011111 011: 100: 100101, 100111 100: 101: 101: 110: 110111 110: 111: 111: Batch: 100, 011, 001, 110
The figure also shows the two bucket tables and a batch list used to record which buckets are actually being used. This is necessary because you cannot afford to inspect every single bucket: The number of buckets may be much larger than the number of elements.
The first step is to perform bucketing on the high-order bits; in this example, the first 3 bits of each number. Every time a bucket is used for the first time, the number of this bucket is recorded in the batch list.
The next figure shows the second step of the algorithm, where you traverse each non-empty bucket in the High bucket table, moving elements into the Low bucket table. There are several important things to note.
High Low ------------------------ -------------------- 000: 000: 001: 001000 001: 010: 010: 011: 011001 011: 100: 100101 100: 001100 101: 101: 110: 110111 110: 111: 111: 100111, 011111 Batch: 100, 011, 001, 110, 111, 100
- You use the batch to find the nonempty buckets. In this way, the whole operation can be performed in linear time.
- When a Low bucket is used for the first time, the bucket is recorded in the batch.
- You don’t move all of the elements. The minimum element in each High bucket is left behind. This is crucial. The batch will be our reduced sorting problem and it must not contain more than n elements. By leaving one element behind in each nonempty High bucket, you ensure that the total number of nonempty buckets, and hence the number of entries in the batch, is at most n.
The next step is to sort the batch. This is the reduced sorting problem – the batch contains at most n numbers and each number has b/2 bits.
Using the sorted batch, it’s straightforward to assemble the numbers in ascending order. You start by moving numbers back from the Low bucket list to the High bucket list. This figure is the result.
High Low ------------------------ -------- 000: 000: 001: 001000, 001100 001: 010: 010: 011: 011001, 011111 011: 100: 100101, 100111 100: 101: 101: 110: 110111 110: 111: 111: Batch: 001, 011, 100, 100, 110, 111
Once again, you use the batch to find the nonempty buckets. But because the batch is now sorted, the Low buckets will be visited in order. Hence, each High bucket will end up containing a sorted list.
Finally, you traverse the nonempty High buckets, once again using the sorted batch, to produce the final sorted list.
You might have noticed a problem with this algorithm. How do you know if a bucket is empty? The obvious solution is to initialize each bucket to a null pointer. But you can’t afford that: The number of buckets might be larger than n and the algorithm is supposed to run in linear time. The solution is a bit tricky; see How to Avoid Initializing Memory.
Fast merging of short numbers
All that remains is to merge short sequences of short numbers fast. The idea is to use a parallel algorithm that only requires very simple operations. If the algorithm is simple enough, it may be possible to simulate the parallelism using the inherent parallel nature of bitwise logical operations. For example, the bitwise OR operation performs w OR operations in parallel in just one machine cycle.
Batcher’s bitonic merging fulfills the requirements. The algorithm is based on a property of so called “bitonic sequences.” A bitonic sequence is almost sorted: It is the concatenation of an ascending and a descending sequence, or it can be obtained by a rotation (cyclic shift) of such a sequence. For example, the sequence 2, 3, 4, 3, 2, 1 is bitonic since it is the concatenation of the ascending sequence 2, 3, 4 and the descending sequence 3, 2, 1. The sequence 3, 4, 3, 2, 1, 2 is also bitonic, because it can be obtained by rotating the previous bitonic sequence. Bitonic sequences have a special property described in this lemma:
Lemma: If the sequence x1, x2, ..., x2k is bitonic, the two sequencesL = min(x1, xk+1), min(x2, xk+2), …, min(xk, x2k)
and
R = max(x1, xk+1), max(x2, xk+2), …, max(xk, x2k)
are also bitonic. Furthermore, each element of L is smaller than or equal to each element of R.
In other words:
- Start with any bitonic sequence with an even number of elements, cut it in half, and form two new sequences by comparing the two halves element by element.
- Form one sequence, L, containing the minimum elements of each pair and an other, R, which contains the maximum elements.
- Then the sequences L and R will both be bitonic and each element of L will be smaller than or equal to each element of R.
Using this lemma, you can efficiently merge sorted sequences. For instance, by reversing one sequence in this figure
X = 0, 2, 3, 5 Y = 2, 3, 6, 7 // Form a bitonic sequence S by reversing X and appending Y: S = 5, 3, 2, 0, 2, 3, 6, 7 // Compute the min and max sequences L and R: L = min(5, 2), min(3, 3), min(2, 6), min(0, 7) = 2, 3, 2, 0 R = max(5, 2), max(3, 3), max(2, 6), max(0, 7) = 5, 3, 6, 7
and appending the other, you get a bitonic sequence. In the first phase of the merging, you use the lemma to produce two subsequences L and R, where each element of L is no larger than each element of R. To complete the merging, you only need to sort L and R and concatenate the two sequences. This can be done using the lemma once again. Remember that both L and R are bitonic, so in the second phase, you can apply the bitonic lemma to each of them. The process is repeated recursively until the subsequences consist of just one element.
Packing all of the numbers in one machine word, it’s possible to perform each phase of this merging algorithm in constant time.
A = 0101 0011 0010 0000 0000 0000 0000 0000 // 5 3 2 0 0 0 0 0 B = 0010 0011 0110 0111 0000 0000 0000 0000 // 2 3 6 7 0 0 0 0 // Compute a bitmask N, showing which elements of A that are // smaller than or equal to the corresponding elements of B. // M is a precomputed constant. M = 1000 1000 1000 1000 1000 1000 1000 1000 // 8 8 8 8 8 8 8 8 N = ((B OR M) - A) AND M // 0 8 8 8 8 8 8 8 N = N - (N>>3) // 0 7 7 7 7 7 7 7 // Compute the max and min sequence and concatenate them. Z = (B AND N) // 0 3 6 7 0 0 0 0 OR (A - (A AND N)) // 5 0 0 0 0 0 0 0 OR ((A AND N)>>16) // 0 0 0 0 0 3 2 0 OR ((B - (B AND N))>>16) // 0 0 0 0 2 0 0 0
In this example, the numbers consist of 3 bits. You’ll also need a special test bit for each number, so you can fit eight numbers into one 32-bit word. The special bitmask N, which indicates which elements are smaller, is computed using subtraction and the basic bitwise AND/OR operations. The second phase of the merging can be performed in a similar fashion. In fact, both subproblems can be solved in parallel using the same technique. (The full details can be found in nloglogn.c.)
Each phase of this merging algorithm is implemented using just a constant number of basic operations so the total time to merge k numbers, small enough to fit within one word, is proportional to log k.
In Dr. Dobb’s Journal, pp. 38-45, Vol. 311, April 01, 2000.
Further reading
See Quicksort optimizations explained for a fast and simple Quicksort implementation.