Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they learn reusable features and are easily fine-tuned in new domains. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we demonstrate that upstream data gives tabular neural networks a decisive advantage over widely used GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning, and we present a how-to guide for using upstream data to boost performance with a variety of tabular neural network architectures. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. Our code is available at https://github.com/LevinRoman/tabular-transfer-learning .
翻译:近期关于表格数据的深度学习研究表明,深度表格模型展现出强劲性能,常能弥合梯度提升决策树与神经网络之间的差距。除准确性外,神经网络模型的一大优势在于其能学习可复用特征,并可在新领域轻松微调。这一特性在计算机视觉与自然语言处理应用中常被利用——当特定任务的训练数据稀缺时,迁移学习不可或缺。本工作中,我们证明上游数据能为表格神经网络提供相较于广泛使用的GBDT模型的决定性优势。我们提出了一个针对表格迁移学习的现实医学诊断基准,并呈现了利用上游数据提升各类表格神经网络架构性能的操作指南。最后,针对上游与下游特征集不一致这一现实应用中广泛存在的表格特有难题,我们提出了一种伪特征方法。我们的代码开源在 https://github.com/LevinRoman/tabular-transfer-learning 。