Large Language Models (LLMs) have shown impressive capabilities in transforming natural language questions about relational databases into SQL queries. Despite recent improvements, small LLMs struggle to handle questions involving multiple tables and complex SQL patterns under a Zero-Shot Learning (ZSL) setting. Supervised Fine-Tuning (SFT) partially compensate the knowledge deficits in pretrained models but falls short while dealing with queries involving multi-hop reasoning. To bridge this gap, different LLM training strategies to reinforce reasoning capabilities have been proposed, ranging from leveraging a thinking process within ZSL, including reasoning traces in SFT, or adopt Reinforcement Learning (RL) strategies. However, the influence of reasoning on Text2SQL performance is still largely unexplored. This paper investigates to what extent LLM reasoning capabilities influence their Text2SQL performance on four benchmark datasets. To this end, it considers the following LLM settings: (1) ZSL, including general-purpose reasoning or not; (2) SFT, with and without task-specific reasoning traces; (3) RL, leveraging execution accuracy as primary reward function; (4) SFT+RL, i.e, a two-stage approach that combines SFT and RL. The results show that general-purpose reasoning under ZSL proves to be ineffective in tackling complex Text2SQL cases. Small LLMs benefit from SFT with reasoning much more than larger ones, bridging the gap of their (weaker) model pretraining. RL is generally beneficial across all tested models and datasets, particularly when SQL queries involve multi-hop reasoning and multiple tables. Small LLMs with SFT+RL excel on most complex datasets thanks to a strategic balance between generality of the reasoning process and optimization of the execution accuracy. Thanks to RL, the7B Qwen-Coder-2.5 model performs on par with 100+ Billion ones on the Bird dataset.
翻译:大型语言模型(LLMs)在将针对关系型数据库的自然语言问题转换为SQL查询方面已展现出令人印象深刻的能力。尽管近期有所改进,但小型LLMs在零样本学习(ZSL)设置下,仍难以处理涉及多表和复杂SQL模式的问题。监督微调(SFT)部分弥补了预训练模型的知识缺陷,但在处理涉及多跳推理的查询时仍显不足。为弥合这一差距,研究者们提出了多种增强推理能力的LLM训练策略,包括在ZSL中利用思维过程、在SFT中加入推理轨迹,或采用强化学习(RL)策略。然而,推理能力对Text2SQL性能的影响在很大程度上仍未得到充分探索。本文研究了LLM的推理能力在多大程度上影响其在四个基准数据集上的Text2SQL性能。为此,本文考虑了以下LLM设置:(1)ZSL,包括或不包括通用推理;(2)SFT,包含或不包含任务特定的推理轨迹;(3)RL,利用执行准确率作为主要奖励函数;(4)SFT+RL,即结合SFT和RL的两阶段方法。结果表明,ZSL下的通用推理被证明在处理复杂的Text2SQL案例时效果不佳。小型LLMs从带有推理的SFT中获益远大于大型模型,从而弥合了其(较弱的)模型预训练差距。RL在所有测试模型和数据集上普遍有益,特别是当SQL查询涉及多跳推理和多个表时。得益于推理过程的通用性与执行准确率优化之间的策略平衡,采用SFT+RL的小型LLMs在大多数复杂数据集上表现出色。借助RL,7B参数的Qwen-Coder-2.5模型在Bird数据集上的表现与超过1000亿参数的模型相当。