Components

Select Connection: INPUT[inlineListSuggester(optionQuery(#area)):connections] Date Created: INPUT[dateTime(defaultValue(null)):Date_Created] Due Date: INPUT[dateTime(defaultValue(null)):Due_Date] Priority Level: INPUT[inlineSelect(option(1 Critical), option(2 High), option(3 Medium), option(4 Low)):Priority_Level] Status: INPUT[inlineSelect(option(1 To Do), option(2 In Progress), option(3 Testing), option(4 Completed), option(5 Blocked)):Status]

Description

Let $P$ be the dataset, $q \in X$ a query, $k$ the number of neighbors to find, $δ$ the success probability (or expected recall), and $x_{s} \in X$ a generic point of the dataset. The modified PUFFINN algorithm returns set PQ of at most $k$ points such that, with probability at least $δ$ , it contains points $x \in P$ such that: $dist (q, x) \leq min {dist (q, x_{k}), dist (q, x_{s})}$ where $x_{k}$ is the true $k$ th nearest neighbor.

Proof

Let $x_{1}, ..., x_{k}$ be the $k$ nearest neighbors of $q$ , with k $\geq 1$ . At each stage of the algorithm, the invariant ” $PQ . s i ze < k$ or $p (q, x_{k}^{'}) \leq p (q, x_{k})$ ” is satisfied even with the modification, due to the fact that the new probability stopping condition either accelerates termination or defaults to the original guarantee.

Let’s prove the original invariant by induction. base: $PQ$ is empty, so the condition $PQ . s i ze < k$ holds induction: assume that the invariant holds before inserting a new point $x$ into PQ. We need to prove it still holds after the insertion. There are two cases:

$PQ . s i ze < k$ . After the insertions, either $PQ . s i ze$ is still $< k$ or it becomes exactly $k$ . In the latter case, we need to show that $p (q, x_{k}^{'}) \leq p (q, x_{k})$
$PQ . s i ze = k$ . If x is not inserted because $dist (q, x) \geq dist (q, x_{k}^{'})$ , the invariant still holds. Otherwise, it replaces the previous $x_{k}^{'}$ , since the algorithm replaces the point with highest distance. We need to show that new $x_{k}^{'}$ satisfies $p (q, x_{k}^{'}) \leq p (q, x_{k})$ We still have to prove that $p (q, x_{k}^{'}) \leq p (q, x_{k})$ for any new $x_{k}^{'}$ . By construction, it holds that for any point $x \in PQ$ , $d i s t (q, x) \leq d i s t (q, x_{k}^{'})$ . Since $x_{k}$ is the true k nearest neighbor, we have that:

d i s t (q, x_{k}^{'}) \geq d i s t (q, x_{k})

From the definition and the monotonicity of the LSH Family, we can conclude that $d i s t (q, x_{k}^{'}) \geq d i s t (q, x_{k})$ implies $p (q, x_{k}^{'}) \leq p (q, x_{k})$ , thus concluding the proof of the invariant.

Now we want to find the minimum number of tries to visit before returning a set of at most $k$ points each one with probability at least $δ$ to be among the true k nearest neighbors or within $dist (q, x_{s})$ .

Two cases may occur:

$dist (q, x_{s}) > dist (q, x_{k}^{'})$ (or $p (q, x_{s}) < p (q, x_{k}^{'})$ ), this case reverts to the original PUFFINN guarantee since $p_{e ff} = p (q, x_{k})$ . Suppose we are at level $i$ of the search and the probability of collision at this level is $p (q, x_{k})^{i}$ . If we perform $j$ independent searches at level $i$ , the probability that none of these candidates collide with $x_{k}$ is $(1 - (p (q, x_{k})^{i})^{j}$ . Our aim is to find $j$ that satisfies:

(1 - p (q, x_{k})^{i})^{j} \leq 1 - δ

(1 - p (q, x_{k})^{i})^{j} \leq 1 - δ ln ((1 - p (q, x_{k})^{i})^{j}) \leq ln (1 - δ) j \cdot ln (1 - p (q, x_{k})^{i}) \leq ln (1 - δ) j \leq \frac{ln ( 1 - δ )}{ln ( 1 - p ( q , x _{k} ) ^{i} )} take the logarithm of both sides use ln a^{b} = b ln a isolate j

At this point we can use the Taylor expansion $ln (1 - x) = - x - x^{2} /2 - x^{3} /3 + O (x^{4})$ , on the dividend, to obtain:

j j j j \leq \frac{ln ( 1 - δ )}{ln ( 1 - p ( q , x _{k} ) ^{i} )} \leq \frac{ln ( 1 - δ )}{- p ( q , x _{k} ) ^{i} + O ( c )} \leq \frac{ln ( 1 - δ )}{- p ( q , x _{k} ) ^{i}} \geq \frac{l n ( 1/ ( 1 - δ ))}{p ( q , x _{k} ) ^{i}}

since we proved that $d i s t (q, x_{k}^{'}) \geq d i s t (q, x_{k})$ implies $p (q, x_{k}^{'}) \leq p (q, x_{k})$ , we can conclude the proof:

j \geq \frac{l n ( 1/ ( 1 - δ ))}{p ( q , x _{k} ) ^{i}} \geq \frac{l n ( 1/ ( 1 - δ ))}{p ( q , x _{k}^{'} ) ^{i}}

$dist (q, x_{s}) \leq dist (q, x_{k}^{'})$ (or $p (q, x_{s}) \geq p (q, x_{k}^{'})$ )). The proof repeats as before, with $x_{s}$ instead of $x_{k}$ . Obtaining that with: $j \geq \frac{l n ( 1/ ( 1 - δ ))}{p ( q , x _{s} ) ^{i}}$ we can guarantee, with probability $δ$ that all $x \in P$ with $p (q, x) \geq (p, x_{s})$ (i.e $d i s t (q, x) \leq d i s t (q, x_{s})$ ) are found.

🌱 Enrico's Digital Garden

Explorer

New proof

Components

Description

Proof

Graph View

Table of Contents