Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary--foreign key connectivity for multi-table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi-tabular relational databases from scratch. In a step-by-step fashion, PluRel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power-law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.
翻译:关系基础模型通过从复杂的多表数据库中学习,促进了数据驱动的决策。然而,由于隐私限制,训练此类模型所需的多样化关系数据库很少公开。虽然存在生成任意大小的合成表格数据的方法,但为多表生成融入模式结构和主键-外键连通性仍然具有挑战性。本文我们介绍PluRel,一个从零开始合成多表关系数据库的框架。PluRel通过逐步方式建模:(1)使用有向图表示模式,(2)使用二分图表示表间主键-外键连通性,以及(3)通过条件因果机制表示表中的特征分布。这些阶段的设计空间支持合成广泛多样的数据库,同时计算开销轻量。使用PluRel,我们首次观察到:(1)RFM预训练损失随合成数据库数量和总预训练标记数呈现幂律缩放,(2)增加合成数据库数量可改善对真实数据库的泛化能力,以及(3)合成预训练为在真实数据库上继续预训练提供了强大的基础模型。总体而言,我们的框架和结果将合成数据缩放确立为RFMs一个有前景的范式。