Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce $\textbf{RDB-PFN}$, the first relational foundation model trained purely via $\textbf{synthetic data}$. Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a $\textbf{Relational Prior Generator}$ to create an infinite stream of diverse RDBs from scratch. Pre-training on $\textbf{over 2 million}$ synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine $\textbf{in-context learning}$. Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines (given the same DFS-linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN
翻译:关系数据库是现代商业的支柱,然而其缺乏可与文本或视觉领域相媲美的基础模型。一个关键障碍在于高质量的关系数据库具有私有性、稀缺性及结构异质性,使得互联网规模的预训练难以实现。为克服数据稀缺问题,我们提出了 $\textbf{RDB-PFN}$,这是首个完全通过 $\textbf{合成数据}$ 训练的关系基础模型。受启发于利用结构因果模型生成合成数据以实现单表推理的先验数据拟合网络,我们设计了一个 $\textbf{关系先验生成器}$,用于从头创建无限流的多样化关系数据库。通过在 $\textbf{超过 200 万个}$ 合成单表及关系任务上进行预训练,RDB-PFN 学会了通过真正的 $\textbf{上下文学习}$ 即时适应任何新数据库。实验验证了 RDB-PFN 在 19 个真实世界关系预测任务上取得了强大的少样本性能,优于基于图的方法和单表基础模型基线(在相同的深度优先搜索线性化输入条件下),同时采用了轻量级架构并实现了快速推理。代码发布于 https://github.com/MuLabPKU/RDBPFN