We study the fundamental problem of clustering $n$ points into $K$ groups drawn from a mixture of isotropic Gaussians in $\mathbb{R}^d$. Specifically, we investigate the requisite minimal distance $Δ$ between mean vectors to partially recover the underlying partition. While the minimax-optimal threshold for $Δ$ is well-established, a significant gap exists between this information-theoretic limit and the performance of known polynomial-time procedures. Although this gap was recently characterized in the high-dimensional regime ($n \leq dK$), it remains largely unexplored in the moderate-dimensional regime ($n \geq dK$). In this manuscript, we address this regime by establishing a new low-degree polynomial lower bound for the moderate-dimensional case when $d \geq K$. We show that while the difficulty of clustering for $n \leq dK$ is primarily driven by dimension reduction and spectral methods, the moderate-dimensional regime involves more delicate phenomena leading to a "non-parametric rate". We provide a novel non-spectral algorithm matching this rate, shedding new light on the computational limits of the clustering problem in moderate dimension.
翻译:本文研究将 $n$ 个点划分为 $K$ 个簇的基本聚类问题,这些点采样自 $\mathbb{R}^d$ 空间中的各向同性高斯混合分布。具体而言,我们探讨了均值向量间所需的最小距离 $Δ$ 与部分恢复底层划分的关系。尽管 $Δ$ 的极小极大最优阈值已得到充分确立,但该信息论极限与已知多项式时间算法的性能之间存在显著差距。虽然这一差距在高维情形($n \leq dK$)中近期已得到刻画,但在中维情形($n \geq dK$)中仍基本未被探索。本文针对 $d \geq K$ 时的中维情形,通过建立新的低阶多项式下界来解决该问题。我们证明,当 $n \leq dK$ 时聚类的困难主要源于降维与谱方法,而中维情形则涉及更精细的现象,导致出现“非参数速率”。我们提出了一种匹配该速率的新型非谱算法,为中维情形下聚类问题的计算极限提供了新的理论洞见。