Table Question Answering (TQA) aims to answer natural language questions over structured tables. Large Language Models (LLMs) enable promising solutions to this problem, with operator-centric solutions that generate table manipulation pipelines in a multi-step manner offering state-of-the-art performance. However, these solutions rely on multiple LLM calls, resulting in prohibitive latencies and computational costs. We propose Operation-R1, the first framework that trains lightweight LLMs (e.g., Qwen-4B/1.7B) via a novel variant of reinforcement learning with verifiable rewards to produce high-quality data-preparation pipelines for TQA in a single inference step. To train such an LLM, we first introduce a self-supervised rewarding mechanism to automatically obtain fine-grained pipeline-wise supervision signals for LLM training. We also propose variance-aware group resampling to mitigate training instability. To further enhance robustness of pipeline generation, we develop two complementary mechanisms: operation merge, which filters spurious operations through multi-candidate consensus, and adaptive rollback, which offers runtime protection against information loss in data transformation. Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2.2$\times$ reduction in monetary cost.
翻译:表格问答旨在基于结构化表格回答自然语言问题。大型语言模型为该问题提供了前景广阔的解决方案,其中以操作为中心的解决方案通过多步方式生成表格操作流水线,实现了最先进的性能。然而,这些方案依赖多次LLM调用,导致难以承受的延迟和计算成本。我们提出Operation-R1,这是首个通过具有可验证奖励的新型强化学习变体训练轻量级LLM(如Qwen-4B/1.7B)的框架,能够在单次推理步骤中为表格问答生成高质量的数据准备流水线。为训练此类LLM,我们首先引入自监督奖励机制,自动获取细粒度流水线级监督信号用于LLM训练。同时提出方差感知分组重采样以缓解训练不稳定性。为进一步增强流水线生成的鲁棒性,我们开发了两种互补机制:通过多候选共识过滤伪操作的操作合并机制,以及提供运行时保护以防止数据转换中信息丢失的自适应回滚机制。在两个基准数据集上的实验表明,在相同LLM骨干网络下,Operation-R1相比多步准备基线实现了平均绝对准确率提升9.55和6.08个百分点,同时达到79%的表格压缩率和2.2倍的货币成本降低。