The ratio of two densities provides a direct characterization of their differences. We consider the two-sample comparison problem by estimating this ratio given i.i.d. observations from two distributions. To this end, we propose additive tree models for density ratio estimation along with efficient algorithms using a new loss function, the balancing loss. The loss allows tree-based models to be trained using several algorithms originally designed for supervised learning, such as forward-stagewise optimization and gradient boosting. Moreover, the balancing loss resembles an exponential family kernel, and it can serve as a pseudo-likelihood with conjugate priors. This property enables generalized Bayesian inference on the density ratio using backfitting samplers designed for Bayesian additive regression trees (BART). Our Bayesian strategy provides uncertainty quantification for the inferred density ratio, which is critical for applications involving high-dimensional and data-limited distributions with potentially substantial uncertainty. We further show connections of the balancing loss to the exponential loss in binary classification and to the variational form of f-divergence, particularly the squared Hellinger distance. Numerical experiments demonstrate that our method achieves both accuracy and computational efficiency, while uniquely providing uncertainty quantification. Finally, we demonstrate its application to assessing the quality of generative models for microbiome compositional data.
翻译:两个密度之比直接刻画了它们之间的差异。我们通过估计来自两个独立同分布观测样本的密度比来研究双样本比较问题。为此,我们提出用于密度比估计的加性树模型,并设计基于新型损失函数——平衡损失的高效算法。该损失函数使基于树的模型能够利用最初为监督学习设计的多种算法(如前向分步优化和梯度提升)进行训练。此外,平衡损失具有指数族核的性质,可作为共轭先验下的伪似然函数。这一特性使得我们能利用为贝叶斯加性回归树(BART)设计的回拟合采样器,对密度比进行广义贝叶斯推断。我们的贝叶斯策略为推断密度比提供了不确定性量化,这对涉及高维数据受限分布且可能存在显著不确定性的应用至关重要。我们进一步揭示了平衡损失与二分类中的指数损失以及f-散度的变分形式(特别是平方Hellinger距离)之间的关联。数值实验表明,我们的方法在实现准确性和计算效率的同时,独具不确定性量化能力。最后,我们将其应用于评估微生物组成分数据的生成模型质量。