Components

Select Connection: INPUT[inlineListSuggester(optionQuery(#area)):connections] Date Created: INPUT[dateTime(defaultValue(null)):Date_Created] Due Date: INPUT[dateTime(defaultValue(null)):Due_Date] Priority Level: INPUT[inlineSelect(option(1 Critical), option(2 High), option(3 Medium), option(4 Low)):Priority_Level] Status: INPUT[inlineSelect(option(1 To Do), option(2 In Progress), option(3 Testing), option(4 Completed), option(5 Blocked)):Status]

Description

Let where is the maximum radius for any cluster and is the th true nearest neighbor distance. If the dataset belongs to a metric space of doubling dimension then, with probability , we have that the query algorithm terminates with expected running time:

proof

The query algorithm defined in Algorithm \ref{} returns the true nearest neighbors with probability , hence the time complexity result we derive holds with probability . At iteration of the algorithm, denote by as the farthest point (in terms of distance to ) stored in the priority queue PQ and let

where is the radius of the cluster and is the maximum radius of all clusters. Initially, at , PQ is empty so and . As the algorithm proceeds in the next iterations, decreases incrementally until it converges to the true -nearest neighbor distance . To ensure all points within distance of are found, CLANN must examine all clusters that intersect the -ball centered at . For cluster to intersect the -ball it must hold that: Rearranging, we obtain:

Therefore, every cluster contributing to the final -nearest neighbors must lie within a ball of radius centered at .

Given that the dataset belongs to a metric space of doubling dimension , we can apply the doubling space property to bound the number of relevant clusters. This property allows us to recursively cover each ball of radius with balls of half the radius , then each of those with balls of radius and so on. At each scale the number of clusters needed is . Summing over all scales until we get the total number of potentially relevant clusters:

Define , then:

when (i.e., when the maximum cluster radius is significantly larger than the distance to the -th nearest neighbor), this simplifies to . Bounded the number of relevant clusters, we can conclude the running time proof.

For each of the clusters, the algorithm performs two main operations:

  • A call to the PUFFINN search algorithm, which returns the candidate nearest neighbors in expected time with probability . This time is linked to the optimal expected running time of an algorithm that knows optimal parameter choices for the number of tries and depth of each one.
  • At most priority queue insertions, where each insertion requires time for a total of per cluster. Additionally, the initial sorting of the clusters is performed in time . Thus, the overall expected running time is:

Under typical assumptions where the sorting and term are dominated by the PUFFINN search cost (which is typically true for large datasets), this expression simplifies asymptotically to:

concluding the proof.