We propose an unsupervised tree boosting algorithm for inferring the underlying sampling distribution of an i.i.d. sample based on fitting additive tree ensembles in a fashion analogous to supervised tree boosting. Integral to the algorithm is a new notion of "addition" on probability distributions that leads to a coherent notion of "residualization", i.e., subtracting a probability distribution from an observation to remove the distributional structure from the sampling distribution of the latter. We show that these notions arise naturally for univariate distributions through cumulative distribution function (CDF) transforms and compositions due to several "group-like" properties of univariate CDFs. While the traditional multivariate CDF does not preserve these properties, a new definition of multivariate CDF can restore these properties, thereby allowing the notions of "addition" and "residualization" to be formulated for multivariate settings as well. This then gives rise to the unsupervised boosting algorithm based on forward-stagewise fitting of an additive tree ensemble, which sequentially reduces the Kullback-Leibler divergence from the truth. The algorithm allows analytic evaluation of the fitted density and outputs a generative model that can be readily sampled from. We enhance the algorithm with scale-dependent shrinkage and a two-stage strategy that separately fits the marginals and the copula. The algorithm then performs competitively to state-of-the-art deep-learning approaches in multivariate density estimation on multiple benchmark data sets.
翻译:我们提出一种无监督树提升算法,用于基于拟合加性树集成(类似于监督式树提升)的方式推断独立同分布样本的底层采样分布。该算法的核心是引入概率分布上一种新的"加法"概念,从而衍生出连贯的"残差化"概念——即从观测中减去一个概率分布,以去除后者采样分布中的分布结构。研究表明,对于单变量分布,这些概念通过累积分布函数变换与复合自然存在,这源于单变量累积分布函数的若干"类群"特性。传统多变量累积分布函数无法保留这些特性,但通过重新定义多变量累积分布函数可恢复这些特性,从而使得"加法"与"残差化"概念同样适用于多变量场景。基于此,我们提出一种前向分阶段拟合加性树集成的无监督提升算法,该算法能逐步降低与真实分布间的Kullback-Leibler散度。该算法可实现拟合密度的解析求解,并输出可直接采样的生成模型。我们通过尺度依赖型收缩和两阶段策略(分别拟合边缘分布与联结函数)增强算法性能。在多个人工基准数据集的多变量密度估计任务中,该算法展现出与先进深度学习方法相竞争的性能。