Deep learning methods have demonstrated outstanding performances on classification and regression tasks on homogeneous data types (e.g., image, audio, and text data). However, tabular data still poses a challenge with classic machine learning approaches being often computationally cheaper and equally effective than increasingly complex deep learning architectures. The challenge arises from the fact that, in tabular data, the correlation among features is weaker than the one from spatial or semantic relationships in images or natural languages, and the dependency structures need to be modeled without any prior information. In this work, we propose a novel deep learning architecture that exploits the data structural organization through topologically constrained network representations to gain spatial information from sparse tabular data. The resulting model leverages the power of convolutions and is centered on a limited number of concepts from network topology to guarantee (i) a data-centric, deterministic building pipeline; (ii) a high level of interpretability over the inference process; and (iii) an adequate room for scalability. We test our model on 18 benchmark datasets against 5 classic machine learning and 3 deep learning models demonstrating that our approach reaches state-of-the-art performances on these challenging datasets. The code to reproduce all our experiments is provided at https://github.com/FinancialComputingUCL/HomologicalCNN.
翻译:深度学习方法在图像、音频和文本等同构数据类型上的分类与回归任务中展现出卓越性能。然而,表格数据仍构成挑战:经典机器学习方法通常计算成本更低,且与日益复杂的深度学习架构同样有效。这一挑战源于表格数据中特征间的相关性弱于图像或自然语言中的空间/语义关联,并且依赖结构需在无先验信息的情况下建模。本文提出一种新型深度学习架构,通过拓扑约束的网络表示挖掘数据的结构组织,从而从稀疏表格数据中获取空间信息。该模型利用卷积的强大能力,并围绕网络拓扑中的有限概念构建,以保障:(i) 以数据为中心的确定性构建流程;(ii) 推理过程的高度可解释性;(iii) 充分的扩展空间。我们在18个基准数据集上,将本模型与5种经典机器学习模型和3种深度学习模型进行对比测试,结果显示我们的方法在这些具有挑战性的数据集上达到了最先进的性能。所有实验的复现代码已开源:https://github.com/FinancialComputingUCL/HomologicalCNN。