Relational databases (RDBs) underpin the majority of global data management systems, where information is structured into multiple interdependent tables. To effectively use the knowledge within RDBs for predictive tasks, recent advances leverage graph representation learning to capture complex inter-table relations as multi-hop dependencies. Despite achieving state-of-the-art performance, these methods remain hindered by the prohibitive storage overhead and excessive training time, due to the massive scale of the database and the computational burden of intensive message passing across interconnected tables. To alleviate these concerns, we propose and study the problem of Relational Database Distillation (RDD). Specifically, we aim to distill large-scale RDBs into compact heterogeneous graphs while retaining the predictive power (i.e., utility) required for training graph-based models. Multi-modal column information is preserved through node features, and primary-foreign key relations are encoded via heterogeneous edges, thereby maintaining both data fidelity and relational structure. To ensure adaptability across diverse downstream tasks without engaging the traditional, inefficient bi-level distillation framework, we further design a kernel ridge regression-guided objective with pseudo-labels, which produces quality features for the distilled graph. Extensive experiments on multiple real-world RDBs demonstrate that our solution substantially reduces the data size while maintaining competitive performance on classification and regression tasks, creating an effective pathway for scalable learning with RDBs.
翻译:关系型数据库构成了全球大多数数据管理系统的基石,其中信息被组织成多个相互关联的表。为了有效利用关系型数据库中的知识进行预测任务,近期研究利用图表示学习来捕获复杂的表间关系作为多跳依赖。尽管这些方法实现了最先进的性能,但由于数据库规模庞大以及跨互连表进行密集消息传递带来的计算负担,它们仍受限于高昂的存储开销和过长的训练时间。为缓解这些问题,我们提出并研究了关系型数据库蒸馏问题。具体而言,我们的目标是将大规模关系型数据库蒸馏为紧凑的异构图,同时保留训练基于图的模型所需的预测能力(即效用)。多模态列信息通过节点特征得以保留,主键-外键关系通过异质边进行编码,从而同时保持数据保真度和关系结构。为确保在不同下游任务中的适应性,同时避免使用传统低效的双层蒸馏框架,我们进一步设计了基于核岭回归的伪标签优化目标,为蒸馏后的图生成高质量特征。在多个真实世界关系型数据库上的大量实验表明,我们的解决方案在保持分类和回归任务竞争力的同时,显著减少了数据规模,为关系型数据库的可扩展学习开辟了有效途径。