We present a new multi-layer peeling technique to cluster points in a metric space. A well-known non-parametric objective is to embed the metric space into a simpler structured metric space such as a line (i.e., Linear Arrangement) or a binary tree (i.e., Hierarchical Clustering). Points which are close in the metric space should be mapped to close points/leaves in the line/tree; similarly, points which are far in the metric space should be far in the line or on the tree. In particular we consider the Maximum Linear Arrangement problem \cite{Approximation_algorithms_for_maximum_linear_arrangement} and the Maximum Hierarchical Clustering problem \cite{Hierarchical_Clustering:_Objective_Functions_and_Algorithms} applied to metrics. We design approximation schemes ($1 - \epsilon$ approximation for any constant $\epsilon > 0$) for these objectives. In particular this shows that by considering metrics one may significantly improve former approximations ($0.5$ for Max Linear Arrangement and $0.74$ for Max Hierarchical Clustering). Our main technique, which is called multi-layer peeling, consists of recursively peeling off points which are far from the "core" of the metric space. The recursion ends once the core becomes a sufficiently densely weighted metric space (i.e. the average distance is at least a constant times the diameter) or once it becomes negligible with respect to its inner contribution to the objective. Interestingly, the algorithm in the Linear Arrangement case is much more involved than that in the Hierarchical Clustering case, and uses a significantly more delicate peeling.
翻译:我们提出了一种新的多层级剥离技术,用于对度量空间中的点进行聚类。一个经典的非参数化目标是将度量空间嵌入到更简单的结构化度量空间,例如直线(即线性排列)或二叉树(即层次聚类)。度量空间中相近的点应映射到直线/树上的相近点/叶节点;类似地,度量空间中相距较远的点应映射到直线或树上的较远位置。我们特别考虑了应用于度量的最大线性排列问题\cite{Approximation_algorithms_for_maximum_linear_arrangement}和最大层次聚类问题\cite{Hierarchical_Clustering:_Objective_Functions_and_Algorithms}。我们为这些目标设计了近似方案(对任意常数$\epsilon > 0$实现$1 - \epsilon$近似)。这特别表明,通过考虑度量结构,可以显著改进先前的近似结果(最大线性排列为0.5,最大层次聚类为0.74)。我们的主要技术称为多层级剥离,其核心是递归地剥离远离度量空间“核心”的点。递归过程在以下两种情况下终止:核心变为一个足够密集加权的度量空间(即平均距离至少为直径的常数倍),或者核心对其内部目标贡献变得可忽略不计。有趣的是,线性排列情况下的算法比层次聚类情况下的算法更为复杂,并使用了显著更精细的剥离策略。