Score-debiased kernel density estimation (SD-KDE) achieves improved asymptotic convergence rates over classical KDE, but its use of an empirical score has made it significantly slower in practice. We show that by re-ordering the SD-KDE computation to expose matrix-multiplication structure, Tensor Cores can be used to accelerate the GPU implementation. On a 32k-sample 16-dimensional problem, our approach runs up to $47\times$ faster than a strong SD-KDE GPU baseline and $3{,}300\times$ faster than scikit-learn's KDE. On a larger 1M-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in $2.3$ s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.
翻译:分数去偏核密度估计(SD-KDE)相比经典KDE实现了更优的渐近收敛速率,但其经验分数的使用导致实际计算速度显著下降。本文通过重构SD-KDE的计算顺序以显式呈现矩阵乘结构,实现了利用Tensor Core加速GPU计算。在包含32k个样本的16维问题上,本方法比高性能SD-KDE GPU基准实现快达$47\times$,比scikit-learn的KDE快$3{,}300\times$。在包含100万个样本的16维任务中(基于131k个查询点评估),Flash-SD-KDE在单GPU上仅需$2.3$秒即可完成,使得分数去偏密度估计在以往不可行的规模上具备了实际应用价值。