Tabular machine learning is an important field for industry and science. In this field, table rows are usually treated as independent data samples, but additional information about relations between them is sometimes available and can be used to improve predictive performance. Such information can be naturally modeled with a graph, thus tabular machine learning may benefit from graph machine learning methods. However, graph machine learning models are typically evaluated on datasets with homogeneous node features, which have little in common with heterogeneous mixtures of numerical and categorical features present in tabular datasets. Thus, there is a critical difference between the data used in tabular and graph machine learning studies, which does not allow one to understand how successfully graph models can be transferred to tabular data. To bridge this gap, we propose a new benchmark of diverse graphs with heterogeneous tabular node features and realistic prediction tasks. We use this benchmark to evaluate a vast set of models, including simple methods previously overlooked in the literature. Our experiments show that graph neural networks (GNNs) can indeed often bring gains in predictive performance for tabular data, but standard tabular models also can be adapted to work with graph data by using simple feature preprocessing, which sometimes enables them to compete with and even outperform GNNs. Based on our empirical study, we provide insights for researchers and practitioners in both tabular and graph machine learning fields.
翻译:表格机器学习是工业界和科学界的重要研究领域。在该领域中,表格行通常被视为独立的数据样本,但有时可获得关于行间关系的附加信息,并可用于提升预测性能。此类信息可自然地用图结构建模,因此表格机器学习可能受益于图机器学习方法。然而,图机器学习模型通常在具有同质节点特征的数据集上进行评估,这些特征与表格数据集中存在的数值型和类别型特征异构混合体几乎没有共同点。因此,表格与图机器学习研究所用数据存在关键差异,这导致我们无法理解图模型能否成功迁移至表格数据。为弥合这一差距,我们提出了一个包含异构表格节点特征与真实预测任务的多样化图基准。我们使用该基准评估了大量模型,包括先前文献中被忽视的简单方法。实验表明,图神经网络确实能经常为表格数据带来预测性能的提升,但通过简单的特征预处理,标准表格模型也能适配图数据工作,有时甚至可与图神经网络竞争并超越其性能。基于实证研究,我们为表格与图机器学习领域的研究者与实践者提供了重要见解。