The $(k, z)$-Clustering problem in Euclidean space $\mathbb{R}^d$ has been extensively studied. Given the scale of data involved, compression methods for the Euclidean $(k, z)$-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error $\varepsilon$, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean $(k, z)$-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when $k$ is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for $(k, z)$-Clustering establishes a tight space bound of $\Theta( n d )$ for terminal embedding, where $n$ represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.
翻译:$(k, z)$-聚类问题在欧几里得空间 $\mathbb{R}^d$ 中已被广泛研究。鉴于所涉及的数据规模,针对欧几里得 $(k, z)$-聚类问题的压缩方法,例如数据压缩和降维,已在文献中受到显著关注。然而,聚类问题的空间复杂度——即需将成本函数压缩至乘法误差 $\varepsilon$ 内所需的比特数——在现有文献中仍不明确。本文首次研究了欧几里得 $(k, z)$-聚类问题的空间复杂度,并给出了上界和下界。当 $k$ 为常数时,我们的空间界几乎紧确,表明存储核心集(一种著名的数据压缩方法)是最优的压缩方案。此外,针对 $(k, z)$-聚类问题的下界结果,我们为终端嵌入建立了紧确的空间界 $\Theta( n d )$,其中 $n$ 表示数据集大小。我们的技术方法利用了主角度和差异方法的新几何洞见,这些洞见可能具有独立的研究价值。