The ratio of two densities provides a direct characterization of their differences. We consider the two-sample comparison problem by estimating this ratio given i.i.d. observations from two distributions. To this end, we propose additive tree models for density ratio estimation along with efficient algorithms using a new loss function, the balancing loss. The loss allows tree-based models to be trained using several algorithms originally designed for supervised learning, such as forward-stagewise optimization and gradient boosting. Moreover, the balancing loss resembles an exponential family kernel, and it can serve as a pseudo-likelihood with conjugate priors. This property enables generalized Bayesian inference on the density ratio using backfitting samplers designed for Bayesian additive regression trees (BART). Our Bayesian strategy provides uncertainty quantification for the inferred density ratio, which is critical for applications involving high-dimensional and data-limited distributions with potentially substantial uncertainty. We further show connections of the balancing loss to the exponential loss in binary classification and to the variational form of f-divergence, particularly the squared Hellinger distance. Numerical experiments demonstrate that our method achieves both accuracy and computational efficiency, while uniquely providing uncertainty quantification. Finally, we demonstrate its application to assessing the quality of generative models for microbiome compositional data.
翻译:两个密度函数的比值能够直接刻画它们之间的差异。给定来自两个分布的独立同分布观测样本,我们通过估计该密度比来研究双样本比较问题。为此,我们提出了一种用于密度比估计的可加树模型,并采用一种新的损失函数——平衡损失,设计了高效的估计算法。该损失函数使得基于树的模型能够使用多种原本为监督学习设计的算法进行训练,例如前向逐步优化和梯度提升。此外,平衡损失类似于指数族核函数,并且可以作为具有共轭先验的伪似然函数。这一特性使得我们能够利用为贝叶斯可加回归树(BART)设计的回拟合采样器,对密度比进行广义贝叶斯推断。我们的贝叶斯策略为推断出的密度比提供了不确定性量化,这对于涉及高维、数据有限且可能存在显著不确定性的分布的应用至关重要。我们进一步揭示了平衡损失与二分类中的指数损失之间的联系,以及与f-散度(特别是平方Hellinger距离)的变分形式之间的关联。数值实验表明,我们的方法在保证计算效率的同时实现了较高的准确性,并且独特地提供了不确定性量化。最后,我们展示了该方法在评估微生物组组成数据生成模型质量中的应用。