Estimating the ratio of two probability densities from finitely many observations of the densities is a central problem in machine learning and statistics with applications in two-sample testing, divergence estimation, generative modeling, covariate shift adaptation, conditional density estimation, and novelty detection. In this work, we analyze a large class of density ratio estimation methods that minimize a regularized Bregman divergence between the true density ratio and a model in a reproducing kernel Hilbert space (RKHS). We derive new finite-sample error bounds, and we propose a Lepskii type parameter choice principle that minimizes the bounds without knowledge of the regularity of the density ratio. In the special case of quadratic loss, our method adaptively achieves a minimax optimal error rate. A numerical illustration is provided.
翻译:从有限观测数据中估计两个概率密度的比率是机器学习和统计学中的核心问题,应用于双样本检验、散度估计、生成建模、协变量偏移自适应、条件密度估计以及新颖性检测。本文分析了一类大规模的密度比率估计方法,这些方法通过最小化真实密度比率与再生核希尔伯特空间(RKHS)中模型之间的正则化布雷格曼散度进行估计。我们推导了新的有限样本误差边界,并提出了一种莱普斯基型参数选择原则,该原则无需了解密度比率的正则性即可最小化误差边界。在二次损失的特定情况下,我们的方法自适应地达到了极小极大最优误差率。文中还提供了数值实验验证。