Discrete Determinantal Point Processes (DPPs) have a wide array of potential applications for subsampling datasets. They are however held back in some cases by the high cost of sampling. In the worst-case scenario, the sampling cost scales as O(n^3) where n is the number of elements of the ground set. A popular workaround to this prohibitive cost is to sample DPPs defined by low-rank kernels. In such cases, the cost of standard sampling algorithms scales as O(np^2 + nm^2) where m is the (average) number of samples of the DPP (usually m << n) and p the rank of the kernel used to define the DPP (m \leq p \leq n). The first term, O(np^2), comes from a SVD-like step. We focus here on the second term of this cost, O(nm^2), and show that it can be brought down to O(nm + m^3 log m) without loss on the sampling's exactness. In practice, we observe very substantial speedups compared to the classical algorithm as soon as n > 1000. The algorithm described here is a close variant of the standard algorithm for sampling continuous DPPs, and uses rejection sampling. In the specific case of projection DPPs, we also show that any additional sample can be drawn in time O(m^3 log m). Finally, an interesting by-product of the analysis is that a realisation from a DPP is typically contained in a subset of size O(m log m) formed using leverage score i.i.d. sampling.
翻译:离散决定点过程(DPPs)在数据集子采样方面具有广泛的应用潜力。然而,在某些情况下,其高昂的采样成本限制了其应用。在最坏情况下,采样成本与O(n^3)成正比,其中n是基础集合的元素数量。一个常见的解决这一高昂成本的方法是采样由低秩核定义的DPPs。在这种情况下,标准采样算法的成本与O(np^2 + nm^2)成正比,其中m是DPP的(平均)样本数量(通常m << n),p是用于定义DPP的核的秩(m ≤ p ≤ n)。第一项O(np^2)来自类似奇异值分解(SVD)的步骤。本文专注于该成本的第二项O(nm^2),并证明在不损失采样准确性的前提下,该项可以降低到O(nm + m^3 log m)。在实践中,我们观察到当n > 1000时,与经典算法相比,速度显著提升。本文描述的算法是连续DPPs标准算法的近亲变体,并使用了拒绝采样。对于投影DPPs的特例,我们还展示了任何额外样本可以在O(m^3 log m)时间内生成。最后,分析的一个有趣副产品是:DPP的样本通常包含在一个大小为O(m log m)的子集中,该子集通过杠杆分数独立同分布采样形成。