Subspace clustering becomes inherently difficult near intersections, where points from different subspaces are barely separated. Most existing theoretical results address this issue by imposing separation or sampling assumptions that limit the statistical effect of points near the intersection. We study a minimal setting of two intersecting lines in which the latent sampling law places polynomially large mass in small neighborhoods of the intersection. We derive information-theoretic lower bounds for exact and almost exact recovery under Gaussian noise. In particular, we show that the exact-recovery threshold is determined by the rate at which the latent law concentrates near the intersection. Since any two points are collinear, pairwise information alone does not reveal whether they are sampled from the same latent line. We therefore construct a hypergraph in which nearly collinear triples form hyperedges, and study the resulting hypergraph similarity matrix. Under a simple regularity condition on the latent distribution, we introduce a spectral algorithm that achieves the information-theoretic bounds up to polylogarithmic factors.
翻译:子空间聚类在交点附近变得固有困难,此时来自不同子空间的点几乎无法分离。大多数现有理论结果通过施加分离或采样假设来解决此问题,这些假设限制了交点附近点的统计影响。我们研究了相交直线的最小设置,其中潜在采样律在交点的小邻域内放置多项式大小的质量。我们推导了在高斯噪声下精确恢复和几乎精确恢复的信息论下界。特别地,我们表明精确恢复阈值由潜在律在交点附近的集中速率决定。由于任意两点共线,成对信息本身无法揭示它们是否来自同一条潜在直线。因此,我们构建了一个超图,其中近乎共线的三元组构成超边,并研究由此产生的超图相似性矩阵。在潜在分布的简单正则条件下,我们引入了一种谱算法,该算法在多项式对数因子内达到了信息论界。