Cite

[1] M. Aumüller, T. Christiani, R. Pagh, e M. Vesterli, «PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors», 28 giugno 2019, arXiv: arXiv:1906.12211. doi: 10.48550/arXiv.1906.12211

Synthesis

Abstract

We present PUFFINN, a parameterless LSH-based index for solving the -nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorithm, we make heuristics rigorous. We perform experiments on real-world and synthetic inputs to evaluate implementation choices and show that the implementation satisfies the quality guarantees while being competitive with other state-of-the-art approaches to nearest neighbor search. We describe a novel synthetic data set that is difficult to solve for almost all existing nearest neighbor search approaches, and for which PUFFINN significantly outperform previous methods. .

Notes

Faster distance computations with sketching

produce 1 bit sketches for efficient similarity estimation. (keyword is estimation, not actual computation)

IDEA (to find more concrete on other papers):

  • use a random hash function to hash the output of an LSH function to a single bit
  • pack bits into a w-bit machine word
  • use word parallelism or table lookups to count the number of collisions between w sketches

Why hash to a single bit?

indicates that two items are mapped to the same bucket, meaning it is probable that they are similar

Also it is really efficient

What is a machine word?

Hardware word, meaning 32 or 64 bit. This allows word parallelism with bitwise operations.

Why is it fast?

Allows to compute multiple similarities in one go, leveraging word parallelism and bitwise operations.

Index Data Structure

trie built on the hash codes of the data points

variant of the LSH Forest

constructed by recursively splitting the set of points S on the next character until or . At that point, we have a leaf that stores references to points in

k-NN definition

Given a dataset and an integer , the nearest neighbor problem is to build a data structure, such that for every query , the query algorithm returns a set of k distinct points, each one being with probability at least among the k points in S closest to q

guarantees an expected recall of

Locality-sensitive hashing framework

A LSH family is a family of functions , such that for each pair and a random , for arbitrary , whenever we have:

an lsh function maps a datapoint to a hash code with the condition that closer points are more likely to collide.

concatenate lsh functions to make it less probable that close points and far away points collide.

K is a function of the number of points of the dataset, the approximation factor c and the strength of the hash family.

from K and the hash family the guarantee of finding a close point with constant probability can be computed other methods does not have this theoretical guarantee!

Query Algorithm

define:

as the subset of points in S that collide with q when we consider the first i hash values of the jth trie.

The algorithm is as follows:

Start with the bucket at depth and explore all tries. Once done that, you can move to the next level. Track the top-k closest points with a data structure PQ.

Stop the search once we have explored enough tries where our candidate would have been found with probability .

Lemma 3 states that the algorithm cannot terminate until j tries have been searched at level i where either:

  • , with this the probability of not finding a kNN is at most
  • or .

Annotations