While Large Language Models (LLMs) have advanced the state-of-the-art in Text-to-SQL, robust reasoning in complex, multi-table environments remains a bottleneck for parameter-efficient models. This paper presents a systematic empirical study on injecting reasoning capabilities into Text-to-SQL through the lens of Reinforcement Learning with Verifiable Rewards (RLVR). We uncover a critical interplay between reward density, advantage scaling, and model capacity. Our analysis yields four primary insights. First, we propose a novel execution-guided dense reward function that significantly outperforms binary signals and existing state-of-the-art rewards by providing granular feedback at the instance level. Second, we analyze the mechanics of advantage calculation, demonstrating that while large models thrive on sparse signals with aggressive advantage scaling, smaller models require dense rewards and conservative scaling to improve Text-to-SQL performance. Third, we evaluate the impact of cold start, showing that distillation does not always improve RLVR performance and that supervised, fine-tuned models are prone to distributional mimicry. Fourth, we map the Pareto frontier of training efficiency, providing insights for optimizing Text-to-SQL reasoning under computational constraints. Our findings culminate in the Think2SQL family: our 4B-parameter model demonstrates reasoning capabilities competitive with state-of-the-art models such as o3. We release our models, datasets, and code to create a blueprint for RLVR optimization in Text-to-SQL at https://anonymous.4open.science/r/Think2SQL-3B7F.
翻译:尽管大型语言模型(LLMs)在Text-to-SQL任务上已取得先进成果,但对于参数高效的模型而言,在复杂多表环境中的鲁棒推理能力仍是瓶颈。本文通过可验证奖励的强化学习视角,对如何向Text-to-SQL注入推理能力进行了系统的实证研究。我们揭示了奖励密度、优势值缩放与模型容量之间的关键相互作用。分析得出四项主要发现。首先,我们提出了一种新颖的执行导向密集奖励函数,通过在实例层面提供细粒度反馈,其性能显著优于二元信号及现有最优奖励方法。其次,我们分析了优势值计算机制,证明虽然大模型能通过稀疏信号与激进的优势值缩放取得良好效果,但小模型需要密集奖励与保守的缩放策略才能提升Text-to-SQL性能。第三,我们评估了冷启动的影响,表明知识蒸馏并不总能提升RLVR性能,且经过监督微调的模型容易陷入分布模仿。第四,我们绘制了训练效率的帕累托前沿,为在计算资源受限条件下优化Text-to-SQL推理提供了参考。我们的研究成果最终体现为Think2SQL系列模型:其中4B参数模型展现出与o3等前沿模型相竞争的推理能力。我们在https://anonymous.4open.science/r/Think2SQL-3B7F 公开了模型、数据集与代码,为Text-to-SQL领域的RLVR优化提供了技术蓝图。