Cite
[1] M. Ceccarello, A. Pietracaprina, e G. Pucci, «Solving -center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially», 1 giugno 2021, arXiv: arXiv:1802.09205. doi: 10.48550/arXiv.1802.09205.
Synthesis
Contribution::
Strong Points::
Weak Points::
Related::
Metadata
FirstAuthor:: Ceccarello, Matteo
Author:: Pietracaprina, Andrea
Author:: Pucci, Geppino
~
Title:: Solving -center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially
Year:: 2021
Citekey:: ceccarelloSolvingcenterClustering2021
itemType:: preprint
DOI:: 10.48550/arXiv.1802.09205
LINK
Abstract
Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular -center variant which, given a set of points from some metric space and a parameter , requires to identify a subset of centers in minimizing the maximum distance of any point of from its closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter and allows up to points of (outliers) to be disregarded when computing the maximum distance from the centers. We present coreset-based 2-round MapReduce algorithms for the above two formulations of the problem, and a 1-pass Streaming algorithm for the case with outliers. For any fixed , the algorithms yield solutions whose approximation ratios are a mere additive term away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) . These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones. .
Notes
(Composable) Coreset
a coreset is a small subset extracted from the input which has a solution with cost close to the optimal solution of the whole set
composable → if coresets are from distinct subsets their union has a close-to-optimal solution
Core Idea
based on the GMM sequential approx. alg. for k-center
GMM
first phase: find k centers in each subset
second phase: compute the solution on the coreset formed by the union of the centers
→ contribution: select more centers from each subset in the first phase
Doubling dimension
The doubling dimension of is the smallest D such that for any radius and point , all points in the ball of radius centered at are included in the union of at most balls of radius centered at suitable points
→ it follows that a ball of radius can be covered by at most balls of radius
GMM algorithm
sequential 2-appoximation
How it works:
select an arbitrary point as the center and add it to T. Iteratively, select the next center and add it to T, until T contains k centers
Property
Let . For a given k, let be the outputof GMM when run on X. We have that:
(GMM is a 2-approximation algorithm)
k-center problem
Given a set S of points in the metric space and a positive integer , find a subset of k points (centers) so that the maximum distance between any point of S to its closest center in T is minimized.
→ outliers can influence a lot the final solution, but there exists a formulation to account for outliers.
MapReduce algorithm for k-center
First round:
-
partition S in subsets of equal size
-
on each subset, run GMM incrementally until where:
- is the set of i centers selected in the first j iterations
- is the radius of the set with respect to the first k centers
→ the coreset will be
Second round:
- make the union of the coresets and run GMM to compute the final set of k centers
Variant with z outliers
in the first round, we define for each point s its proxy as before, but we add to each a weight (number of points of with proxy t)
in the second round, instead of using GMM we use an algorithm called OutliersCluster to solve a weighted variant of k-center with outliers..