The astute reader will recognize thatreturns a positive integer based on a bag-of-grams where grams are slices of words between 2 and 7 magic.
The idea is that simhash will capture similarity that exist in small-ish strings:
Given a giant set of documents and a new document, figuring which document is the most similar requires to compute the simhash before hand at index time, and then at query time, it requires to compare the simhash of the new input document with the simhash of ALL the known documents the complexity is at least O(n).
In fuzzbuzz,is not documented magic, but was chosen to be two times bigger than the count of known documents: 2^32 = 4 294 967 296. That is around 4/2 billions documents. That is around 2 billions * 32 / 8 = 8 000 000 bytes otherwise 8GB of memory required just to store the simhashes. That is way too much for my laptop with 12GB of RAM.
It is not possible to pre-compute (aka. index) the Hamming distance of all possible input documents.
What about indexing, in an Ordered Key-Value store, the simhash instead?
That turns out to be is possible.
Given a simhash of 8 bits, one can construct a merkel-tree with binaryoperator as a hash function and serialize the resulting tree using a depth-first search back to a bit string called .
edit: the merkle-tree hash function is binary.