We study graph clustering in the Stochastic Block Model (SBM) in the presence of both large clusters and small, unrecoverable clusters. Previous convex relaxation approaches achieving exact recovery do not allow any small clusters of size $o(\sqrt{n})$, or require a size gap between the smallest recovered cluster and the largest non-recovered cluster. We provide an algorithm based on semidefinite programming (SDP) which removes these requirements and provably recovers large clusters regardless of the remaining cluster sizes. Mid-sized clusters pose unique challenges to the analysis, since their proximity to the recovery threshold makes them highly sensitive to small noise perturbations and precludes a closed-form candidate solution. We develop novel techniques, including a leave-one-out-style argument which controls the correlation between SDP solutions and noise vectors even when the removal of one row of noise can drastically change the SDP solution. We also develop improved eigenvalue perturbation bounds of potential independent interest. Our results are robust to certain semirandom settings that are challenging for alternative algorithms. Using our gap-free clustering procedure, we obtain efficient algorithms for the problem of clustering with a faulty oracle with superior query complexities, notably achieving $o(n^2)$ sample complexity even in the presence of a large number of small clusters. Our gap-free clustering procedure also leads to improved algorithms for recursive clustering.
翻译:本文研究随机块模型(SBM)中同时存在大规模聚类与不可恢复小规模聚类时的图聚类问题。以往实现精确恢复的凸松弛方法要么不允许存在规模为$o(\sqrt{n})$的小聚类,要么要求最小可恢复聚类与最大不可恢复聚类之间存在规模间隙。我们提出一种基于半定规划(SDP)的算法,该算法消除了这些限制,并能在任意剩余聚类规模条件下理论保证恢复大规模聚类。中等规模聚类对分析提出了独特挑战:由于其接近恢复阈值,这些聚类对微小噪声扰动高度敏感,且无法获得闭式候选解。我们发展了包括留一法式论证在内的新技术,该技术能控制SDP解与噪声向量之间的相关性——即使移除单行噪声可能剧烈改变SDP解的情况仍然有效。我们还提出了具有独立理论价值的改进特征值扰动界。我们的结果对某些半随机设置具有鲁棒性,这些设置对其他算法构成挑战。利用无间隙聚类方法,我们为故障预言机聚类问题获得了具有优越查询复杂度的有效算法,特别是在存在大量小聚类时仍能达到$o(n^2)$的样本复杂度。该无间隙聚类方法还催生了递归聚类问题的改进算法。