Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL

While Large Language Models (LLMs) have advanced the state-of-the-art in Text-to-SQL, robust reasoning in complex, multi-table environments remains a bottleneck for parameter-efficient models. This paper presents a systematic empirical study on injecting reasoning capabilities into Text-to-SQL through the lens of Reinforcement Learning with Verifiable Rewards (RLVR). We uncover a critical interplay between reward density, advantage scaling, and model capacity. Our analysis yields four primary insights. First, we propose a novel execution-guided dense reward function that significantly outperforms binary signals and existing state-of-the-art rewards by providing granular feedback at the instance level. Second, we analyze the mechanics of advantage calculation, demonstrating that while large models thrive on sparse signals with aggressive advantage scaling, smaller models require dense rewards and conservative scaling to improve Text-to-SQL performance. Third, we evaluate the impact of cold start, showing that distillation does not always improve RLVR performance and that supervised, fine-tuned models are prone to distributional mimicry. Fourth, we map the Pareto frontier of training efficiency, providing insights for optimizing Text-to-SQL reasoning under computational constraints. Our findings culminate in the Think2SQL family: our 4B-parameter model demonstrates reasoning capabilities competitive with state-of-the-art models such as o3. We release our models, datasets, and code to create a blueprint for RLVR optimization in Text-to-SQL at https://anonymous.4open.science/r/Think2SQL-3B7F.

翻译：尽管大型语言模型（LLMs）已将文本到SQL（Text-to-SQL）技术推向了新的前沿，但在复杂的多表环境中实现稳健推理，对于参数高效的模型而言仍是一个瓶颈。本文通过可验证奖励的强化学习（RLVR）视角，对如何将推理能力注入Text-to-SQL进行了系统的实证研究。我们揭示了奖励密度、优势缩放与模型容量之间关键的相互作用。我们的分析得出了四项主要见解。首先，我们提出了一种新颖的执行引导密集奖励函数，通过在实例层面提供细粒度反馈，其性能显著优于二元信号及现有最先进的奖励机制。其次，我们分析了优势计算机制，证明虽然大型模型能通过稀疏信号与激进的优势缩放取得良好效果，但小型模型需要密集奖励和保守的缩放策略才能提升Text-to-SQL性能。第三，我们评估了冷启动的影响，表明知识蒸馏并不总能提升RLVR性能，且经过监督微调的模型容易陷入分布模仿。第四，我们绘制了训练效率的帕累托前沿，为在计算资源受限条件下优化Text-to-SQL推理提供了洞见。我们的研究成果最终汇聚为Think2SQL系列模型：我们提出的40亿参数模型展现出与o3等最先进模型相媲美的推理能力。我们在https://anonymous.4open.science/r/Think2SQL-3B7F 发布了模型、数据集和代码，旨在为Text-to-SQL领域的RLVR优化提供一套蓝图。