Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (LLMs) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning LLM for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-preserving permutations per sample and estimating advantages both within and across permutations, PRPO transforms sparse rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of LLMs for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger LLMs across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).
翻译:表格预测传统上依赖于梯度提升决策树和专用深度学习模型,这些方法在特定任务中表现出色,但可解释性有限且跨表格迁移能力较弱。推理大语言模型(LLMs)凭借其透明的推理轨迹展现出跨任务适应潜力,但其在表格数据领域的潜力尚未完全实现。本文提出TabR1——首个具备多步推理能力的表格预测推理大语言模型。其核心是置换相对策略优化(PRPO),这是一种将列置换不变性编码为结构先验的简洁高效强化学习方法。通过为每个样本构建多个保持标签不变的置换,并在置换内部与跨置换间估计优势值,PRPO将稀疏奖励转化为密集学习信号并提升泛化能力。在有限监督下,PRPO能激活大语言模型的表格预测推理能力,增强小样本/零样本性能及可解释性。综合实验表明:在全监督微调下,TabR1达到与强基线相当的性能;在零样本场景中,TabR1逼近强基线在32样本设置下的表现。此外,TabR1(8B)在各种任务上显著超越规模更大的大语言模型,较DeepSeek-R1(685B)最高提升达53.17%。