I want to make an autocomplete system for Philomena that enables private search-as-you-type. That is, it never reveals to the server what the user query is. It's obviously possible to do this by sending a list of all tag names to the client, but doing this by sending a JSON array over the wire has some distinct disadvantages:
- Quite large
- For Derpibooru, this would require sending about 10MB of data to every client for tag completion
- Creates tons of garbage on the JS heap
- This is more significant than it may seem at first, because these objects are long-lived and result in every major sweep requiring the browser to mark millions of objects
- Slow enough to drop frames even on fast computers
- JavaScript isn't particularly known for its speed, and this could potentially mean lagging the page until the search results are fully generated
We can do better. Let's refine our requirements a little bit:
- We want a solution that runs in JS, not WebAssembly
- Instead of all 400,000+ tags on Derpibooru, let's limit our database to the top 64K tags, ordered by their use count
- This gets things down to about 2MB, which is more reasonable and less than the size of a typical image
- Like the existing autocompletion, we are interested in prefix matching, not matching anywhere in the tag
- Matching anywhere in the tag would require either bruteforcing or special indexing
- Would be slower and probably not feasible not to drop frames
For prefix matching, the obvious choice is a trie. Unfortunately, that doesn't really work out well for us because it requires scanning results in ways that end up creating more garbage than the next solution, which is a list of tags sorted by their name.
With a list of tags sorted by their name, we can implement the prefix match by binary searching for the first tag name "less than" the prefix and collecting results until encountering the first tag "greater than" the prefix (by numeric codepoint).
Because using JSON for this would require deserializing everything into JS objects, we will ship it as a binary blob. This means the overall structure needs to be designed a bit more carefully to accommodate this:
- String table
- Array of fixed-size tag structs, sorted by name, with a pointer into the string table
- Footer indicating the location of the start of the array
32-bit pointers are used for simplicity - 24-bit pointers would work as well, but are more difficult to emit and parse.
The file format looks like this, in pseudocode:
struct tag_name {
uint8_t name_length;
uint8_t name[];
};
struct tag {
uint32_t tag_name_location;
uint32_t num_uses;
};
struct footer {
uint32_t tags_start;
uint32_t num_tags;
};
struct autocomplete_file {
struct tag_name names[];
struct tag tags[];
struct footer footer;
};