While Deep Learning has demonstrated impressive results in applications on various data types, it continues to lag behind tree-based methods when applied to tabular data, often referred to as the last "unconquered castle" for neural networks. We hypothesize that a significant advantage of tree-based methods lies in their intrinsic capability to model and exploit non-linear interactions induced by features with categorical characteristics. In contrast, neural-based methods exhibit biases toward uniform numerical processing of features and smooth solutions, making it challenging for them to effectively leverage such patterns. We address this performance gap by using statistical-based feature processing techniques to identify features that are strongly correlated with the target once discretized. We further mitigate the bias of deep models for overly-smooth solutions, a bias that does not align with the inherent properties of the data, using Learned Fourier. We show that our proposed feature preprocessing significantly boosts the performance of deep learning models and enables them to achieve a performance that closely matches or surpasses XGBoost on a comprehensive tabular data benchmark.
翻译:尽管深度学习在多种数据类型应用中展现出令人瞩目的成果,但在处理表格数据时,其性能仍落后于基于树的方法,这一领域常被视为神经网络尚未攻克的“最后堡垒”。我们假设,基于树的方法的一个显著优势在于其内在能够建模并利用具有类别特征属性所诱导的非线性交互作用。相比之下,基于神经网络的方法则表现出对特征的均匀数值处理和平滑解的偏好,这使得它们难以有效利用此类模式。我们通过采用基于统计的特征处理技术来识别那些在离散化后与目标变量强相关的特征,以应对这一性能差距。进一步地,我们利用学习型傅里叶方法来缓解深度模型对过度平滑解的偏好——这种偏好与数据的内在特性并不一致。实验表明,我们所提出的特征预处理方法显著提升了深度学习模型的性能,使其在一个全面的表格数据基准测试中能够达到与XGBoost相媲美甚至超越的性能。