Despite the rise to dominance of deep learning in unstructured data domains, tree-based methods such as Random Forests (RF) and Gradient Boosted Decision Trees (GBDT) are still the workhorses for handling discriminative tasks on tabular data. We explore generative extensions of these popular algorithms with a focus on explicitly modeling the data density (up to a normalization constant), thus enabling other applications besides sampling. As our main contribution we propose an energy-based generative boosting algorithm that is analogous to the second order boosting implemented in popular packages like XGBoost. We show that, despite producing a generative model capable of handling inference tasks over any input variable, our proposed algorithm can achieve similar discriminative performance to GBDT on a number of real world tabular datasets, outperforming alternative generative approaches. At the same time, we show that it is also competitive with neural network based models for sampling.
翻译:尽管深度学习在非结构化数据领域已占据主导地位,但基于树的方法,如随机森林(RF)和梯度提升决策树(GBDT),仍然是处理表格数据判别任务的主力军。我们探索了这些流行算法的生成式扩展,重点在于显式建模数据密度(直至归一化常数),从而在采样之外实现其他应用。作为主要贡献,我们提出了一种基于能量的生成式提升算法,该算法类似于在 XGBoost 等流行软件包中实现的二阶提升。我们表明,尽管所提出的算法生成的是一个能够处理任意输入变量推断任务的生成模型,但其在多个真实世界表格数据集上能达到与 GBDT 相似的判别性能,并优于其他生成式方法。同时,我们证明其在采样任务上也与基于神经网络的模型具有竞争力。