The $(k, z)$-Clustering problem in Euclidean space $\mathbb{R}^d$ has been extensively studied. Given the scale of data involved, compression methods for the Euclidean $(k, z)$-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error $\varepsilon$, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean $(k, z)$-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when $k$ is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for $(k, z)$-Clustering establishes a tight space bound of $\Theta( n d )$ for terminal embedding, where $n$ represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.
翻译:$(k, z)$-聚类问题在欧几里得空间$\mathbb{R}^d$中已得到广泛研究。鉴于所涉及数据的规模,针对欧几里得$(k, z)$-聚类问题的压缩方法(如数据压缩和降维)在文献中受到了大量关注。然而,聚类问题的空间复杂度——即在乘法误差$\varepsilon$内压缩代价函数所需的比特数——在现有文献中仍不明确。本文首次对欧几里得$(k, z)$-聚类的空间复杂度展开研究,并给出了上界和下界。当$k$为常数时,我们的空间界几乎是紧的,这表明存储核集(一种广为人知的数据压缩方法)是最优的压缩方案。此外,我们针对$(k, z)$-聚类问题的下界结果确立了终端嵌入的紧空间界$\Theta( n d )$,其中$n$表示数据集的大小。我们的技术方法利用了主角度和差异方法的新几何洞见,这些结果可能具有独立的研究价值。