Liam White
Bitmap trees

In our previous discussion about bitmaps, one of the problems we had was inefficient range searches. But let's revisit that idea.

Suppose we have a bitmap for each distinct document value. Let's group a bunch of them together, and store their lower bound, upper bound, and union at the same time:

2019-12-25 → 2019-12-31
2019-12-25000001000000000000000
2019-12-26000000100000000000000
2019-12-27000000010000000000000
2019-12-28000000001000000000000
2019-12-29000000000100000000000
2019-12-30000000000010000000000
2019-12-31000000000001000000000
 000001111111000000000

Now, let's do it again with more of our documents:

2020-01-01 → 2020-01-07
2020-01-01000000000000100000000
2020-01-02000000000000010000000
2020-01-03000000000000001000000
2020-01-04000000000000000100000
2020-01-05000000000000000010000
2020-01-06000000000000000001000
2020-01-07000000000000000000100
 000000000000111111100

You can see that in each case, the big union we stored at index time lets us grab every document in the range at query time between our upper and lower bound. Does this mean we have just solved range searching with bitmaps?

If we make the unions bigger, then we are reducing the number of bitmap operations we have to do during a range query, because we will have to check a smaller number of lower and upper bounds. But if the endpoint of a range query falls inside the middle of a large union, then we will have to scan increasingly more documents. We might not end up better off than a scan over all the documents!

So no, we haven't quite solved range searching with bitmaps yet. We want a solution that can combine the low overhead of using large ranges with the precision of using small ranges.

As with many problems in computer science, we can model the solution with a tree. In our tree, we will have small ranges at the leaves, and use the higher layers to summarize the lower layers, with increasingly large ranges and bitmaps that are the unions of the smaller layers they contain.

2019-12-25 → 2020-01-07
2019-12-25 → 2019-12-31000001111111000000000
2020-01-01 → 2020-01-07000000000000111111100
 000001111111111111100

At any level, if our query entirely includes the range in the current node, then we can just intersect with its bitmap and we will have processed everything beneath it in a single step. We can then advance to the inorder successor, or stop if there isn't one.

If our query endpoint is inside the range given by the current node, then we can enter the node and examine each of the sub-ranges it contains in a recursive fashion.

With this construction, we build a tree not dissimilar to a B+Tree, where all the leaves are at the same level, and each level above the leaf level contains exponentially fewer nodes. This lets us effectively evaluate a range query in O(log n) time, proportional to the number of distinct values.

In the next post, and final in this series, we will adapt this range selection algorithm into code to propagate the top documents.