UPDATE: I cover a flaw in this implementation in a follow-up post.
In a previous post on this blog, I showed how I was able to solve sorting with the bitmap tree, using minimal memory at query time. However, this sorting only covered sorting over stored fields. Elasticsearch does not only allow sorting over stored fields; in fact, its ability to do this is almost incidental, since it is really designed for sorting by relevance score, which is computed by a function that evaluates similarity at query time.
For my use, which is much more database-oriented, sorting over stored fields will be the majority usage, relegating relevance sorting to the sidelines. But there is a problem. While in Philomena, direct relevance sorting is not used much (essentially only being used in the "related images" system), it is actually also used indirectly for random sorting and for the random navigation, which are used extensively. The way it is currently implemented in Elasticsearch is to compute a pseudo-random function which takes the image's ID and a random seed as input and generate a random float as output. The results are then sorted using Heapsort to collect the top K documents, and are merged at the collector.
While this would almost certainly be faster in a Rust implementation due to performance advantages over Java, it would still destroy the main advantage of the stored field sorting, which was the minimal CPU and memory usage, even with extremely large sets. Elasticsearch can parallelize the processing by distributing it over each shard, but I do not want to design a distributed search engine, so I need a different approach.
The sort could be kludged by computing a hash ahead of time and storing it, but this doesn't allow seeding the randomness. Another potential approach is to generate random indices into the image ID array at query time and only pick those to load, but doing this introduces the possibility of collisions, and doesn't allow paging deeply into the result set for free the way the stored field sorting using the bitmap tree can.
This problem can be reduced to finding some function \(F\) with these properties:
- \(F\) is a bijection over integers from \([0..n)\) to \([0..n)\)
- \(n\) is the cardinality of the set we want to get a random subset from
- It must be a bijection because \(F\) has to map every index in the set to a new location
- To get outputs to return to the user, we simply evaluate \(F(0)\), \(F(1)\), \(F(2)\) ... and select the documents at the index \(F\) returns
- To offset the search results by (e.g.) 25, we can evaluate \(F(25)\), \(F(26)\), \(F(27)\) ... no results which came before \(F(25)\) will appear again
- \(F\)'s output is pseudorandom
- \(F\) is seedable and can be influenced to output different permutations
- Otherwise, there is no advantage over random sorting used stored fields
As it turns out, functions that match the description of \(F\) exist from cryptography. This specific problem happens to map neatly onto the role of format-preserving encryption, which is mostly mentioned for its use in encrypting things like telephone or credit card numbers in a database column that cannot accommodate large AES ciphertexts. Since format-preserving encryption is not really necessary in any sane database design, documentation on using it is relatively sparse. Since I am not using it for anything security sensitive, I decided to implement a format-preserving encryption scheme myself.
A key observation for format-preserving encryption is that any existing encryption algorithm, like AES which works over 128-bit blocks, can be made to work to encrypt a number from \([0..n)\) to \([0..n)\) by simply re-encrypting the previous output with the same key until the algorithm generates an output in that range. This technique is called cycle-walking, and it works for any input. However, since the output of AES is from \(0\) to \(2^{128}\), and the largest sets we plan to work with would be about \(2^{22}\), each cycle-walk would result in an average of \(2^{106}\) chained invocations of AES to compute, which is infeasible.
A smaller cipher is needed. Ideally, we will use a cipher that has the minimum number of bits needed to represent all values \([0..n)\), since the worst-case odds of ciphertexts failing to fit into the domain become extremely small with relatively few iterations. While a few smaller ciphers exist, most of them are fixed size, and so would result in needless time being burned on evaluations of the cipher function to find an evaluation fitting in the range. One cipher, the Hasty pudding cipher, is notable for having no fixed bit length, but is complicated to implement and perhaps provides an unnecessary level of security not needed for this application.
Conveniently, there is an existing framework for generating arbitrary block ciphers called a Feistel network. The Feistel network splits its input into two parts and uses several rounds with a keyed pseudorandom function to progressively scramble the output bits (the PRF need not be reversible). Since my application is not security-sensitive, I can explore any function for this purpose. After some basic testing, I decided on the FNV-1a hash function, for its relatively good output uniformity and its extreme speed and simple implementation:
// 4-round Feistel network using FNV-1a as the round PRF and no key schedule
// Same key is used for each round
fn feistel_evaluate(input: u64, left_bits: u32, right_bits: u32, key: u64) -> u64 {
let left_mask = (1u64 << left_bits) - 1;
let right_mask = (1u64 << right_bits) - 1;
let mut output = input;
for _ in 0..4 {
let l = (output >> right_bits) & left_mask;
let r = output & right_mask;
let t = (l ^ fnv_1a(r ^ key)) & left_mask;
output = (r << left_bits) | (t & right_mask);
}
output
}
Note that this code is an implementation of unbalanced Feistel network, where the left and right bit lengths can differ. For some reason, code implementing an unbalanced cipher was surprisingly hard to track down, so I am providing this code for reference.
Finally, there is the cycle walking algorithm itself which implements the overall format-preserving encryption:
// Format-preserving encryption via cycle-walking for sets of size n
// Returns a random permutation of integers 0..n with no repeats or skips
// Key can be used to influence the generated permutation
fn cycle_walking_fpe(input: u64, n: u64, key: u64) -> u64 {
let bits = 64 - (n - 1).leading_zeros();
let left_bits = bits >> 1;
let right_bits = bits - left_bits;
let mut output = input;
loop {
output = feistel_evaluate(output, left_bits, right_bits, key);
if output < n {
break output;
}
}
}
This algorithm is fast. It is in fact so fast that I initially had trouble benchmarking its components because the function was executing more quickly than the timer resolution of Cargo's bench system. But after some tweaking and performing repeated iterations of the results to get an aggregate, the benchmark results are in: 40ns for each invocation of \(F\) for a set with 2.5 million elements, which is essentially negligible and far faster than sorting using the bitmap tree.
Elasticsearch takes about 300ms to evaluate a random function over the search results and return the top documents. And because it uses heapsort, it doesn't even allow you to efficiently page deeply into the results the way using format-preserving encryption can. In this respect, I believe I have met the goals I set out to accomplish.
The source code for this experiment is available here.
Disclaimer
I am not a cryptographer. Using cryptographic techniques to generate pseudorandom permutations is an interesting avenue for amateur exploration. In this specific case, it doesn't possess any risk if the encryption is too weak to provide meaningful security, or has issues such as side-channel attacks. Do not use this as an excuse to implement your own cryptographic primitives; always reuse standard implementations, as they will cover your use case if you need real security, and if they don't, think twice before blindly implementing anything you read on the internet—you will probably do it insecurely.