Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to $31$ other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.
翻译:从表格数据(由连续变量和分类变量组织而成的矩阵)中进行推理,是现代科技与科学的基础。然而,与人工智能其他领域的爆炸性变革相比,这些预测任务的最佳实践相对保持不变,仍主要基于梯度提升决策树(GBDT)的变体。近期,基于神经网络和特征学习方法的最新进展,开发用于表格数据的最先进方法重新引起了人们的兴趣。本文介绍了xRFM算法,它将特征学习核机与树结构相结合,既能适应数据的局部结构,又能扩展到本质上无限量的训练数据。我们证明,与包括近期提出的表格基础模型(TabPFNv2)和GBDT在内的31种其他方法相比,xRFM在100个回归数据集上取得了最佳性能,在200个分类数据集上与最优方法竞争并超越了GBDT。此外,xRFM通过平均梯度外积天然地提供了可解释性。