Tabular Foundation Models (TFMs) achieve state-of-the-art zero-shot accuracy on small tabular datasets by meta-learning over synthetic data-generating processes -- making them highly attractive for practitioners who cannot afford large annotated corpora. However, their in-context learning mechanism assumes approximately clean inputs: missing values, outliers, and duplicates in the real-world data create a prior mismatch that degrades both accuracy and confidence calibration simultaneously. Correcting this mismatch requires sequential decisions over cleaning operators whose interactions no static preprocessing rule can anticipate -a natural fit for reinforcement learning~(RL). We introduce L2C2, the first deep RL framework framing tabular data cleaning as prior alignment: a learned policy sequences operators to minimize the distributional gap between dirty input and the TFM's synthetic prior. Six experiments on ten OpenML benchmark datasets establish: 1) three of seven reward designs collapse to degenerate trivial cleaning strategies -- principled reward engineering is scientifically non-trivial; 2) the novel TFMAwareReward reward we propose selects structurally distinct pipelines on 4/10 datasets and achieves higher TabPFN accuracy on those diverging cases (mean 0.851 vs. 0.843; Wilcoxon p=0.063, n=4) while never underperforming; 3) parameterized cleaning actions improve best-found pipeline reward on 9/10 datasets (Wilcoxon p=0.004); and 4) a policy pre-trained on one single source dataset exceeds scratch training at the 2,000-step fine-tuning checkpoint on all three held-out datasets (up to +28.8% after full fine-tuning) demonstrating cross-dataset transfer of prior-alignment knowledge. These findings establish that prior alignment is a principled data preparation strategy for TFM deployment on real-world tabular data.
翻译:表格基础模型通过元学习合成数据生成过程,在小型表格数据集上取得了最先进的零样本准确率,这使得它们对无法承担大型标注语料库成本的从业者极具吸引力。然而,其上下文学习机制假设输入数据近似干净:真实世界数据中的缺失值、异常值和重复值会造成先验失配,从而同时降低准确性和置信度校准效果。纠正这种失配需要针对清洗算子做出序列决策,而任何静态预处理规则都无法预判这些算子间的交互作用——这天然适合采用强化学习。我们提出L2C2,这是首个将表格数据清洗定义为先验对齐的深度强化学习框架:学习到的策略通过算子序列化来最小化脏数据输入与表格基础模型合成先验之间的分布差距。在十个OpenML基准数据集上的六项实验表明:1)七种奖励设计中有三种退化为退化的平凡清洗策略——原则性的奖励工程在科学上具有非平凡性;2)我们提出的新型TFMAwareReward奖励在4/10的数据集上选择了结构不同的流水线,并在这些分歧案例中实现了更高的TabPFN准确率(均值0.851 vs 0.843;Wilcoxon检验p=0.063,n=4),且从未出现性能下降;3)参数化清洗动作在9/10的数据集上改进了最佳流水线奖励(Wilcoxon检验p=0.004);4)在单一源数据集上预训练的策略,在全部三个留存数据集的2000步微调检查点处均超越从头训练的效果(完全微调后最高提升+28.8%),证明了先验对齐知识的跨数据集迁移能力。这些发现表明,先验对齐是在真实世界表格数据上部署表格基础模型时的一种原则性数据准备策略。