Estimating the ratio of two probability densities from finitely many observations of the densities is a central problem in machine learning and statistics with applications in two-sample testing, divergence estimation, generative modeling, covariate shift adaptation, conditional density estimation, and novelty detection. In this work, we analyze a large class of density ratio estimation methods that minimize a regularized Bregman divergence between the true density ratio and a model in a reproducing kernel Hilbert space (RKHS). We derive new finite-sample error bounds, and we propose a Lepskii type parameter choice principle that minimizes the bounds without knowledge of the regularity of the density ratio. In the special case of quadratic loss, our method adaptively achieves a minimax optimal error rate. A numerical illustration is provided.
翻译:从有限个密度观测数据中估计两个概率密度函数的比值,是机器学习和统计学领域的核心问题,广泛应用于双样本检验、散度估计、生成建模、协变量偏移自适应、条件密度估计及异常检测等任务。本研究分析了一类大规模密度比估计方法,这些方法通过最小化真实密度比与再生核希尔伯特空间(RKHS)中模型间的正则化布雷格曼散度进行估计。我们推导了新的有限样本误差界,并提出了一种列普斯基型参数选择准则,该准则可在无需先验了解密度比正则性的情况下使误差界最小化。在二次损失的特殊情形下,该方法自适应地实现了极小化最优误差率。最后给出了数值算例验证。