Cite
[1] M. Aumüller e M. Ceccarello, «Solving k-Closest Pairs in High-Dimensional Data», in Similarity Search and Applications: 16th International Conference, SISAP 2023, A Coruña, Spain, October 9–11, 2023, Proceedings, Berlin, Heidelberg: Springer-Verlag, ott. 2023, pp. 200–214. doi: 10.1007/978-3-031-46994-7_17.
Synthesis
Metadata
FirstAuthor:: Aumüller, Martin
Author:: Ceccarello, Matteo
~
Title:: Solving k-Closest Pairs in High-Dimensional Data
Year:: 2023
Citekey:: aumullerSolvingKClosestPairs2023
itemType:: conferencePaper
Publisher:: Springer-Verlag
Location:: Berlin, Heidelberg
Pages:: 200–214
DOI:: 10.1007/978-3-031-46994-7_17
ISBN:: 978-3-031-46993-0
LINK
Abstract
We investigate the k-closest pair problem in high dimensions, that is finding the k≥1 closest pairs of points in a set S⊆X in a metric space (X,dist). This is a fundamental problem in computational geometry with a wide variety of applications, including network science, data mining, databases, and recommender systems. We propose an exact algorithm with a controllable failure probability, thus allowing the user to specify the desired recall. Our algorithm has expected subquadratic running time under mild assumption on the distance distribution, relying only on the existence of a Locality Sensitive Hash family for the metric at hand. We complement our theoretical analysis with an experimental evaluation, showing that our approach can provide solutions orders of magnitude faster than current state-of-the-art data structures designed for specific metrics. .
Notes
k-closest pair problem
Given a set from a metric space , the task is to identify k pairs of distinct points in that are closest to each other
definition of k-closest pairs:
the k-closest pairs in a set are a sequence of distinct pairs such that for all other pairs and for all :
PUFFIN query algorithm
- check each trie j to find the leaf corresponding to the string
-
traverse the selected trie from bottom and keep track of the current kth closest point
-
continue point 2 until the termination criterion
Termination criterion
- current depth in the tries is
- is smaller than the index of the trie being expected
where is the probability of a collision under random choice of the LSH of the points at distance .