Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to $31$ other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.
翻译:从表格数据(即组织成矩阵的连续变量和分类变量集合)进行推断,是现代技术与科学的基石。然而,与人工智能其他领域的爆炸性变革相比,此类预测任务的最佳实践相对保持不变,目前仍主要基于梯度提升决策树(GBDT)的变体。最近,基于神经网络和特征学习方法的最新进展,学界重新燃起了开发面向表格数据的最先进方法的兴趣。本文提出xRFM算法,该算法将特征学习核机与树结构相结合,既能适应数据的局部结构,又能扩展到几乎无限量的训练数据。我们通过实验证明,与包括近期提出的表格基础模型(TabPFNv2)和GBDT在内的31种其他方法相比,xRFM在100个回归数据集上取得了最佳性能,在200个分类数据集上与最优方法相比具有竞争力,且性能优于GBDT。此外,xRFM通过平均梯度外积天然具备可解释性。