Fine-tuning tabular foundation models (TFMs) under data scarcity is challenging, as early stopping on even scarcer validation data often fails to capture true generalization performance. We propose CausalMixFT, a method that enhances fine-tuning robustness and downstream performance by generating structurally consistent synthetic samples using Structural Causal Models (SCMs) fitted on the target dataset. This approach augments limited real data with causally informed synthetic examples, preserving feature dependencies while expanding training diversity. Evaluated across 33 classification datasets from TabArena and over 2300 fine-tuning runs, our CausalMixFT method consistently improves median normalized ROC-AUC from 0.10 (standard fine-tuning) to 0.12, outperforming purely statistical generators such as CTGAN (-0.01), TabEBM (-0.04), and TableAugment (-0.09). Moreover, it narrows the median validation-test performance correlation gap from 0.67 to 0.30, enabling more reliable validation-based early stopping, a key step toward improving fine-tuning stability under data scarcity. These results demonstrate that incorporating causal structure into data augmentation provides an effective and principled route to fine-tuning tabular foundation models in low-data regimes.
翻译:在数据稀缺条件下微调表格基础模型(TFMs)具有挑战性,因为即使在更稀缺的验证数据上进行早停,也常常无法捕捉到真实的泛化性能。我们提出了CausalMixFT方法,该方法通过使用在目标数据集上拟合的结构因果模型(SCMs)生成结构一致的合成样本,从而增强微调的鲁棒性和下游性能。此方法利用因果信息合成的示例来增强有限的真实数据,在保持特征依赖关系的同时扩展了训练多样性。通过在TabArena的33个分类数据集和超过2300次微调运行上的评估,我们的CausalMixFT方法将归一化ROC-AUC中位数从0.10(标准微调)持续提升至0.12,优于纯统计生成器,如CTGAN(-0.01)、TabEBM(-0.04)和TableAugment(-0.09)。此外,它将验证集-测试集性能相关性差距的中位数从0.67缩小至0.30,从而实现了更可靠的基于验证的早停,这是改善数据稀缺条件下微调稳定性的关键一步。这些结果表明,将因果结构纳入数据增强,为在低数据条件下微调表格基础模型提供了一条有效且基于原理的途径。