Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.
翻译:大型语言模型(LLMs)正日益成为文本转SQL(Text2SQL)系统的核心驱动力,使非专业用户能够使用自然语言查询工业数据库。尽管测试时扩展策略在基于LLM的解决方案中展现出潜力,但其在现实应用中的有效性,尤其是在结合最新推理模型时,仍存在不确定性。本研究对六种轻量级、面向工业的测试时扩展策略以及四种LLM(包括两种推理模型)进行了基准测试,并在BIRD Mini-Dev基准上评估了它们的性能。除标准准确率指标外,我们还报告了推理延迟和令牌消耗量,为实际系统部署提供了相关洞见。研究结果表明,分治提示和少样本示例能够持续提升通用型与推理型LLM的性能。然而,引入额外工作流步骤的效果参差不齐,且基础模型的选择至关重要。本研究揭示了部署Text2SQL系统时在准确性、效率与复杂性之间需权衡的实际问题。