Cite

[1] M. Ceccarello, A. Pietracaprina, e G. Pucci, «Solving -center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially», 1 giugno 2021, arXiv: arXiv:1802.09205. doi: 10.48550/arXiv.1802.09205.

Synthesis

Contribution::

Strong Points::

Weak Points::

Related::

Abstract

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular -center variant which, given a set of points from some metric space and a parameter , requires to identify a subset of centers in minimizing the maximum distance of any point of from its closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter and allows up to points of (outliers) to be disregarded when computing the maximum distance from the centers. We present coreset-based 2-round MapReduce algorithms for the above two formulations of the problem, and a 1-pass Streaming algorithm for the case with outliers. For any fixed , the algorithms yield solutions whose approximation ratios are a mere additive term away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) . These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones. .

Notes

(Composable) Coreset

a coreset is a small subset extracted from the input which has a solution with cost close to the optimal solution of the whole set

composable if coresets are from distinct subsets their union has a close-to-optimal solution

Core Idea

based on the GMM sequential approx. alg. for k-center

GMM
first phase: find k centers in each subset
second phase: compute the solution on the coreset formed by the union of the centers

contribution: select more centers from each subset in the first phase

Doubling dimension

The doubling dimension of is the smallest D such that for any radius and point , all points in the ball of radius centered at are included in the union of at most balls of radius centered at suitable points

it follows that a ball of radius can be covered by at most balls of radius

GMM algorithm

sequential 2-appoximation

How it works:

select an arbitrary point as the center and add it to T. Iteratively, select the next center and add it to T, until T contains k centers

Property

Let . For a given k, let be the outputof GMM when run on X. We have that:

(GMM is a 2-approximation algorithm)

k-center problem

Given a set S of points in the metric space and a positive integer , find a subset of k points (centers) so that the maximum distance between any point of S to its closest center in T is minimized.

outliers can influence a lot the final solution, but there exists a formulation to account for outliers.

MapReduce algorithm for k-center

First round:

  • partition S in subsets of equal size

  • on each subset, run GMM incrementally until where:

    • is the set of i centers selected in the first j iterations
    • is the radius of the set with respect to the first k centers

the coreset will be

Second round:

  • make the union of the coresets and run GMM to compute the final set of k centers

Variant with z outliers

in the first round, we define for each point s its proxy as before, but we add to each a weight (number of points of with proxy t)

in the second round, instead of using GMM we use an algorithm called OutliersCluster to solve a weighted variant of k-center with outliers..

Annotations