When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families-Qwen 2.5 (1.5B/3B/7B) and Llama 3 (1B/3B/8B)-RSAT improves faithfulness 3.7$\times$ over SFT alone (0.224$\rightarrow$0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.
翻译:当语言模型回答表格问题时,用户无法验证哪些单元格影响了哪些推理步骤。我们提出RSAT,一种训练小型语言模型(SLMs,1-8B参数)的方法,使其能够生成带有基于表格证据的单元格级引用的逐步推理过程。第一阶段(SFT)从经过验证的推理轨迹中学习结构化的JSON输出格式。第二阶段(GRPO)优化一个复合奖励函数,该函数以基于NLI的忠实度为核心,同时兼顾引用的有效性和简洁性。在来自两个系列(Qwen 2.5: 1.5B/3B/7B和Llama 3: 1B/3B/8B)的六个模型上,RSAT将忠实度相较于仅使用SFT提升了3.7倍(从0.224提高到0.826),并实现了接近完美的引用有效性(0.992)。事后归因在格式成功率低于13%的情况下完全失效,这证实了归因必须集成到推理过程中,而非事后添加。消融实验表明,忠实度奖励至关重要:移除该奖励会导致忠实度从0.97骤降至0.03。