The task is:
Given a finite set of objects
pool
, and an objectquery
. Look, which ever comes first, within approximatelyt
time, or at mostn
objects subset ofpool
that are nearquery
according to the metricm
, call that subsetcandidates
.
The implementation of Hierarchical Navigable Small Worlds in datasketch1 gives good enough results within short amount of spent time searching compared to quantization, and the hash algorithm I used2
All in all, it looks like a good skeleton for a very big data base for vectors, called humbly: index.
Remember that we do not want an exact sorting algorithm, not only because 1) the data set is very large, and because 1a) we aim to minimize hardware requirements, and 1b) minimize cost of ownership, it must work with non-volatile memory: hdd, ssd, network filesystems, 2) also because we aim for human latency timings, and interactive use, in other words it should be possible to peek at the computation progress most of the time, under the second, above milliseconds latency, or hyper: anytime!
Navigable small worlds offers a hint graph into the shape of the space described by the vectors. Tho, the goal is not to study the space that is known. Albeit interesting to understand existing knowledge in itself, the ultimate goal is given previously unknown vector, find out what, and how it relates to existing knowledge. Like HNSW publication note, it is possible to improve latency of the search inside what is called layers, and what I call glocals, using other indexing methods. That is giving a clue that NSW can be the backbone of a more complex data model for a vector index.
HNSW is flawed in the sense the graph size, hence the database is limited for given maximum level, and maximum number of neighbors. If we set as requirements an infinite knowledge base, where knowledge is streamed all the time, a knowledge base with an unknown shape, it is more complicated to go go go faint hearted into rewriting datasketch with C…hez Scheme. HNSW over an OKVS can work in out-of-core / distributed / multiprocessing / networked settings, beyond the initial benchmark parameters, it will only scale with skilled administration, that an eventually rewrite, or incrementally update in-place the index using new parameters something that is not supported by source available HNSW implementations at this time.
Soooooooooo, What is the question already: What is the change I need to make to HNSW that helps support {just-in-time, online, incremental} construction with ease that performs well?
…
What I will try next is to implement HNSW re-using logarithmic-based balancing measure to balance trees where nodes are what I call glocals. The tree, sort of b+ tree style, I called hint graph above, will not be sorted but the log measure will guarantee there is not too much weight in a node. The number of vector per glocal, that are node’s children remain an open question to me, and may have to be taken into account by rotation algorithm, considering how costly is every glocal’s index construction. Given a query vector, the tree will be traversed so that eventually it yields the same results as an exact k-nn algorithm, meanwhile it will be possible to peek into the process current best candidates so far, so that other agents can continue.
I hope you have a good time through this serie trying to reverse
engineer one third3 of one of the most interesting world
wide wonder ^W^W^W
invention after the Internet. Let me
know what you think!