2. CLANN
Overview
Efficient and practical LSH Family for the Angular Distance.
DEFINITION
Consider the following hash family for points on a unit sphere . Let be a random matrix with i.i.d. Gaussian entries. To hash a point , we compute and then find the point closest to from , where is the i-th standard basis vector of
The collision probability for two points for this hash family function is bound in the following theorem:
THEOREM
Suppose that are such that where . Then:
This theorem shows that the cross-polytope LSH achieves the same bounds as the theoretically optimal LSH for the sphere, in particular a data structure with sub-quadratic space and sublinear query time can be achieved. But this version is not practical, as applying a random rotation takes time proportional to , which is clearly not feasible for large .
To solve this issue, the authors used pseudo-random rotations, applying the transformation , where is the Hadamard transform and is a random diagonal -matrix, with the Fast Hadamard Transform (FHT) which takes time and space . Even this can be too slow for sparse vectors so, before performing the pseudo-random rotation, the dimensionality is reduced from to with feature hashing, taking down evaluation time to .
But even this is not enough, the critical component to make cross-polytope much more efficient than hyperplane LSH is multiprobe, where candidates from multiple cells in each table are considered. Without this modification, cross-polytope performs worse than hyperplane while with multiprobe it has a speed up of nearly .