Spectral Clustering in Birthday Paradox Time

Given a vertex in a $(k, \varphi, ε)$-clusterable graph, i.e. a graph whose vertex set can be partitioned into a disjoint union of $\varphi$-expanders of size $\approx n/k$ with outer conductance bounded by $ε$, can one quickly tell which cluster it belongs to? This question goes back to the expansion testing problem of Goldreich and Ron'11. For $k=2$ a sample of $\approx n^{1/2+O(ε/\varphi^2)}$ logarithmic length walks from a given vertex approximately determines its cluster membership by the birthday paradox: two vertices whose random walk samples are `close' are likely in the same cluster. The study of the general case $k>2$ was initiated by Czumaj, Peng and Sohler [STOC'15], and the works of Chiplunkar et al. [FOCS'18], Gluch et al. [SODA'21] showed that $\approx \text{poly}(k)\cdot n^{1/2+O(ε/\varphi^2)}$ random walk samples suffice for general $k$. This matches the $k=2$ result up to polynomial factors in $k$, but creates a conceptual inconsistency: if the birthday paradox is the guiding phenomenon, then the query complexity should decrease with the number of clusters $k$! Since clusters have size $\approx n/k$, we expect to need $\approx (n/k)^{1/2+O(ε/\varphi^2)}$ random walk samples, which decreases with $k$. We design a novel representation of vertices in a $(k, \varphi, ε)$-clusterable graph by a mixture of logarithmic length walks. This representation uses the optimal $\approx (n/k)^{1/2+O(ε/\varphi^2)}$ walks per vertex, and allows for a fast nearest neighbor search: given $k$ vertices representing the clusters, we can find the cluster of a given query vertex $x$ using nearly linear time in the representation size of $x$. This gives a clustering oracle with query time $\approx (n/k)^{1/2+O(ε/\varphi^2)}$ and space complexity $k\cdot (n/k)^{1/2+O(ε/\varphi^2)}$, matching the birthday paradox bound.

翻译：给定一个$(k, \varphi, ε)$-可聚类图（即其顶点集可划分为大小约$n/k$、内部扩张度为$\varphi$、外部传导率以$ε$为界的互不相交扩张子图的并集）中的一个顶点，能否快速判断其所属的簇？该问题可追溯至Goldreich与Ron'11的扩张性测试问题。当$k=2$时，从给定顶点出发进行约$n^{1/2+O(ε/\varphi^2)}$次对数长度随机游走采样，依据生日悖论原理可近似确定其簇归属：若两个顶点的随机游走样本“接近”，则它们很可能属于同一簇。Czumaj、Peng与Sohler [STOC'15] 开启了$k>2$一般情况的研究，Chiplunkar等人[FOCS'18]与Gluch等人[SODA'21]的工作表明，对于一般$k$值，约$\text{poly}(k)\cdot n^{1/2+O(ε/\varphi^2)}$次随机游走采样即已足够。该结果与$k=2$情形在$k$的多项式因子范围内匹配，但存在概念上的不一致性：若生日悖论是主导现象，则查询复杂度应随簇数$k$的增加而降低！由于簇的大小约为$n/k$，我们预期仅需约$(n/k)^{1/2+O(ε/\varphi^2)}$次随机游走采样，该值随$k$增大而减小。本文设计了一种基于对数长度随机游走混合的顶点表示方法，用于表征$(k, \varphi, ε)$-可聚类图中的顶点。该表示方法对每个顶点仅使用最优的约$(n/k)^{1/2+O(ε/\varphi^2)}$次游走，并支持快速最近邻搜索：给定代表各簇的$k$个顶点，我们可在与查询顶点$x$的表示规模近线性时间内确定其所属簇。由此构建的聚类预言机具有约$(n/k)^{1/2+O(ε/\varphi^2)}$的查询时间与$k\cdot (n/k)^{1/2+O(ε/\varphi^2)}$的空间复杂度，完全符合生日悖论的理论边界。