In the kernel density estimation (KDE) problem one is given a kernel $K(x, y)$ and a dataset $P$ of points in a Euclidean space, and must prepare a data structure that can quickly answer density queries: given a point $q$, output a $(1+\epsilon)$-approximation to $\mu:=\frac1{|P|}\sum_{p\in P} K(p, q)$. The classical approach to KDE is the celebrated fast multipole method of [Greengard and Rokhlin]. The fast multipole method combines a basic space partitioning approach with a multidimensional Taylor expansion, which yields a $\approx \log^d (n/\epsilon)$ query time (exponential in the dimension $d$). A recent line of work initiated by [Charikar and Siminelakis] achieved polynomial dependence on $d$ via a combination of random sampling and randomized space partitioning, with [Backurs et al.] giving an efficient data structure with query time $\approx \mathrm{poly}{\log(1/\mu)}/\epsilon^2$ for smooth kernels. Quadratic dependence on $\epsilon$, inherent to the sampling methods, is prohibitively expensive for small $\epsilon$. This issue is addressed by quasi-Monte Carlo methods in numerical analysis. The high level idea in quasi-Monte Carlo methods is to replace random sampling with a discrepancy based approach -- an idea recently applied to coresets for KDE by [Phillips and Tai]. The work of Phillips and Tai gives a space efficient data structure with query complexity $\approx 1/(\epsilon \mu)$. This is polynomially better in $1/\epsilon$, but exponentially worse in $1/\mu$. We achieve the best of both: a data structure with $\approx \mathrm{poly}{\log(1/\mu)}/\epsilon$ query time for smooth kernel KDE. Our main insight is a new way to combine discrepancy theory with randomized space partitioning inspired by, but significantly more efficient than, that of the fast multipole methods. We hope that our techniques will find further applications to linear algebra for kernel matrices.
翻译:在核密度估计(KDE)问题中,给定一个核函数$K(x, y)$和欧几里得空间中的一个数据集$P$,需要构建一种数据结构,能够快速回答密度查询:对于查询点$q$,输出$\mu:=\frac1{|P|}\sum_{p\in P} K(p, q)$的$(1+\epsilon)$-近似值。KDE的经典方法是[Greengard and Rokhlin]提出的著名快速多极子方法。快速多极子方法将基本空间划分方法与多维泰勒展开相结合,实现了$\approx \log^d (n/\epsilon)$的查询时间(在维度$d$上呈指数增长)。[Charikar and Siminelakis]近期开创的一系列工作通过随机采样和随机空间划分的结合,实现了对$d$的多项式依赖,其中[Backurs等人]给出了一个高效的数据结构,对于平滑核函数的查询时间约为$\approx \mathrm{poly}{\log(1/\mu)}/\epsilon^2$。采样方法固有的对$\epsilon$的二次依赖在小$\epsilon$情况下成本过高。数值分析中的准蒙特卡罗方法解决了这一问题。准蒙特卡罗方法的核心思想是用基于差异性的方法替代随机采样——这一思想最近被[Phillips and Tai]应用于KDE的coresets。Phillips和Tai的工作给出了一种空间高效的数据结构,其查询复杂度约为$\approx 1/(\epsilon \mu)$。这在$1/\epsilon$上具有多项式优势,但在$1/\mu$上呈指数劣势。我们实现了两者兼得:一种数据结构,对于平滑核KDE的查询时间约为$\approx \mathrm{poly}{\log(1/\mu)}/\epsilon$。我们的主要洞见是一种将差异理论与随机空间划分相结合的新方法,该方法受快速多极子方法启发,但效率显著提高。我们希望我们的技术能在核矩阵的线性代数领域找到进一步的应用。