No Need to Train Your RDB Foundation Model

Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we \textit{avoid retraining} a new model each time we wish to predict a new quantity of interest? Foundation models based on in-context learning (ICL) offer a convenient option, but so far are largely restricted to single-table operability. In generalizing to multiple interrelated tables, it is essential to compress variably-sized RDB neighborhoods into fixed-length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL-specific compression should be constrained \emph{within} high-dimensional RDB columns where all entities share units and roles, not \textit{across} columns where the relevance of heterogeneous data types cannot possibly be determined without label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excluding trainable parameters. Hence we arrive at a principled family of RDB encoders that can be seamlessly paired with already-existing single-table ICL foundation models, whereby no training or fine-tuning is required. From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in an easy-to-use open-source RDB foundation model\footnote{\label{foot: RDBLearn_learn} https://github.com/HKUSHXLab/rdblearn} capable of robust performance on unseen datasets out of the box.

翻译：关系数据库（RDB）包含大量异构表格信息，可用于预测建模目的。但由于企业环境中潜在预测目标空间巨大，我们如何能在每次希望预测新的目标变量时避免重新训练新模型？基于上下文学习（ICL）的基础模型提供了一种便捷选择，但迄今为止主要局限于单表操作。在推广到多个互相关联的表格时，必须将可变大小的RDB邻域压缩为固定长度的ICL样本以供解码器使用。然而，此处的细节至关重要：与现有的监督学习RDB流程不同，我们通过理论和实证证据表明，ICL特有的压缩应约束在共享单位和实体角色的高维RDB列内，而非跨列进行——在没有标签信息的情况下，异构数据类型的相关性无法确定。基于此限制条件，我们进一步证明，排除可训练参数实际上不会损害编码器的表达能力。由此我们提出一个原则性的RDB编码器系列，可与现有的单表ICL基础模型无缝集成，且无需任何训练或微调。从实践角度，我们开发了可扩展的SQL原语来实现编码器阶段，从而构建了一个易于使用的开源RDB基础模型\footnote{\label{foot: RDBLearn_learn} https://github.com/HKUSHXLab/rdblearn}，该模型能够在未经训练的情况下对新数据集实现鲁棒性能。