Much of the world's most valued data is stored in relational databases and data warehouses, where the data is organized into many tables connected by primary-foreign key relations. However, building machine learning models using this data is both challenging and time consuming. The core problem is that no machine learning method is capable of learning on multiple tables interconnected by primary-foreign key relations. Current methods can only learn from a single table, so the data must first be manually joined and aggregated into a single training table, the process known as feature engineering. Feature engineering is slow, error prone and leads to suboptimal models. Here we introduce an end-to-end deep representation learning approach to directly learn on data laid out across multiple tables. We name our approach Relational Deep Learning (RDL). The core idea is to view relational databases as a temporal, heterogeneous graph, with a node for each row in each table, and edges specified by primary-foreign key links. Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all input data, without any manual feature engineering. Relational Deep Learning leads to more accurate models that can be built much faster. To facilitate research in this area, we develop RelBench, a set of benchmark datasets and an implementation of Relational Deep Learning. The data covers a wide spectrum, from discussions on Stack Exchange to book reviews on the Amazon Product Catalog. Overall, we define a new research area that generalizes graph machine learning and broadens its applicability to a wide set of AI use cases.
翻译:世界上最有价值的数据大多存储在关系数据库和数据仓库中,数据通过主键-外键关系组织成多个表格。然而,利用这些数据构建机器学习模型既具挑战性又耗时。核心问题在于尚无机器学习方法能够处理通过主键-外键相互连接的多表数据。现有方法仅能从单表中学习,因此数据必须首先通过手动连接和聚合生成单一训练表——即特征工程过程。特征工程不仅缓慢、易错,还会导致模型性能欠优。本文提出一种端到端深度表示学习方法,可直接学习跨多表分布的数据。我们将该方法命名为关系深度学习。其核心思想是将关系数据库视为时间异构图:每个表中每行对应一个节点,主键-外键关系定义边。消息传递图神经网络可自动跨图学习,无需任何手动特征工程即可提取利用所有输入数据的表示。关系深度学习能构建更精准且构建速度更快的模型。为促进该领域研究,我们开发了RelBench——一组基准数据集及关系深度学习的实现。数据涵盖广泛领域,从Stack Exchange讨论到亚马逊产品目录的书评。总体而言,我们定义了一个广义化图机器学习并扩展其适用性的新研究领域,可服务于广泛的人工智能应用场景。