Quantum (Inspired) $D^2$-sampling with Applications

$D^2$-sampling is a fundamental component of sampling-based clustering algorithms such as $k$-means++. Given a dataset $V \subset \mathbb{R}^d$ with $N$ points and a center set $C \subset \mathbb{R}^d$, $D^2$-sampling refers to picking a point from $V$ where the sampling probability of a point is proportional to its squared distance from the nearest center in $C$. Starting with empty $C$ and iteratively $D^2$-sampling and updating $C$ in $k$ rounds is precisely $k$-means++ seeding that runs in $O(Nkd)$ time and gives $O(\log{k})$-approximation in expectation for the $k$-means problem. We give a quantum algorithm for (approximate) $D^2$-sampling in the QRAM model that results in a quantum implementation of $k$-means++ that runs in time $\tilde{O}(\zeta^2 k^2)$. Here $\zeta$ is the aspect ratio (i.e., largest to smallest interpoint distance), and $\tilde{O}$ hides polylogarithmic factors in $N, d, k$. It can be shown through a robust approximation analysis of $k$-means++ that the quantum version preserves its $O(\log{k})$ approximation guarantee. Further, we show that our quantum algorithm for $D^2$-sampling can be 'dequantized' using the sample-query access model of Tang (PhD Thesis, Ewin Tang, University of Washington, 2023). This results in a fast quantum-inspired classical implementation of $k$-means++, which we call QI-$k$-means++, with a running time $O(Nd) + \tilde{O}(\zeta^2k^2d)$, where the $O(Nd)$ term is for setting up the sample-query access data structure. Experimental investigations show promising results for QI-$k$-means++ on large datasets with bounded aspect ratio. Finally, we use our quantum $D^2$-sampling with the known $ D^2$-sampling-based classical approximation scheme (i.e., $(1+\varepsilon)$-approximation for any given $\varepsilon>0$) to obtain the first quantum approximation scheme for the $k$-means problem with polylogarithmic running time dependence on $N$.

翻译：$D^2$采样是基于采样的聚类算法（如$k$-means++）的核心组件。给定包含$N$个点的数据集$V \subset \mathbb{R}^d$和中心集合$C \subset \mathbb{R}^d$，$D^2$采样指从$V$中选取一个点，其采样概率正比于该点到$C$中最近中心的平方距离。从空集$C$开始，经过$k$轮迭代进行$D^2$采样并更新$C$，即构成标准的$k$-means++初始化过程，其时间复杂度为$O(Nkd)$，并为$k$-均值问题提供期望$O(\log{k})$近似比。本文提出一种QRAM模型下的量子算法实现（近似）$D^2$采样，进而构建运行时间为$\tilde{O}(\zeta^2 k^2)$的量子化$k$-means++算法。其中$\zeta$为纵横比（即最大与最小点间距离之比），$\tilde{O}$隐藏了$N, d, k$的多对数因子。通过对$k$-means++的鲁棒近似分析可证明，量子版本仍保持$O(\log{k})$近似保证。进一步，我们利用Tang的样本查询访问模型（博士论文，Ewin Tang，华盛顿大学，2023）对量子$D^2$采样算法进行“去量子化”，得到快速量子启发的经典$k$-means++实现（称为QI-$k$-means++），其运行时间为$O(Nd) + \tilde{O}(\zeta^2k^2d)$，其中$O(Nd)$项用于构建样本查询访问数据结构。实验研究表明，QI-$k$-means++在具有有界纵横比的大规模数据集上表现优异。最后，我们将量子$D^2$采样与已知的基于$D^2$采样的经典近似方案（即对任意给定$\varepsilon>0$的$(1+\varepsilon)$近似）相结合，首次实现了对$N$具有多对数时间依赖性的$k$-均值问题量子近似方案。