In this work we consider the problem of numerical integration, i.e., approximating integrals with respect to a target probability measure using only pointwise evaluations of the integrand. We focus on the setting in which the target distribution is only accessible through a set of $n$ i.i.d. observations, and the integrand belongs to a reproducing kernel Hilbert space. We propose an efficient procedure which exploits a small i.i.d. random subset of $m<n$ samples drawn either uniformly or using approximate leverage scores from the initial observations. Our main result is an upper bound on the approximation error of this procedure for both sampling strategies. It yields sufficient conditions on the subsample size to recover the standard (optimal) $n^{-1/2}$ rate while reducing drastically the number of functions evaluations, and thus the overall computational cost. Moreover, we obtain rates with respect to the number $m$ of evaluations of the integrand which adapt to its smoothness, and match known optimal rates for instance for Sobolev spaces. We illustrate our theoretical findings with numerical experiments on real datasets, which highlight the attractive efficiency-accuracy tradeoff of our method compared to existing randomized and greedy quadrature methods. We note that, the problem of numerical integration in RKHS amounts to designing a discrete approximation of the kernel mean embedding of the target distribution. As a consequence, direct applications of our results also include the efficient computation of maximum mean discrepancies between distributions and the design of efficient kernel-based tests.
翻译:本文研究数值积分问题,即仅利用被积函数的逐点评估来近似计算相对于目标概率测度的积分。我们重点关注目标分布仅能通过$n$个独立同分布观测样本获取,且被积函数属于再生核希尔伯特空间的情形。我们提出了一种高效方法,该方法从初始观测中利用均匀采样或近似杠杆得分采样,提取$m<n$个独立同分布随机子样本。主要结果给出了该算法在两种采样策略下的近似误差上界。该上界提供了子样本量的充分条件,使得在显著减少函数评估次数(从而降低整体计算成本)的同时,仍能恢复标准(最优)$n^{-1/2}$收敛速率。此外,我们获得了关于被积函数评估次数$m$的收敛速率,该速率自适应于函数光滑性,并匹配了如Sobolev空间等已知最优速率。我们通过真实数据集上的数值实验验证理论结果,展现了与现有随机及贪心求积方法相比,本方法在效率-精度权衡方面的优越性。需要指出的是,RKHS中的数值积分问题本质上等价于设计目标分布核均值嵌入的离散近似。因此,本文结果可直接应用于分布间最大均值差异的高效计算与基于核方法的高效检验设计。