While Text-to-SQL systems achieve high accuracy, existing efficiency metrics like the Valid Efficiency Score prioritize execution time, a metric we show is fundamentally decoupled from consumption-based cloud billing. This paper evaluates cloud query execution cost trade-offs between reasoning and non-reasoning Large Language Models by performing 180 Text-to-SQL query executions across six LLMs on Google BigQuery using the 230 GB StackOverflow dataset. Our analysis reveals that reasoning models process 44.5% fewer bytes than non-reasoning counterparts while maintaining equivalent correctness at 96.7% to 100%, and that execution time correlates weakly with query cost at $r=0.16$, indicating that speed optimization does not imply cost efficiency. Non-reasoning models also exhibit extreme cost variance of up to 3.4$\times$, producing outliers exceeding 36 GB per query, over 20$\times$ the best model's 1.8 GB average, due to missing partition filters and inefficient joins. We identify these prevalent inefficiency patterns and provide deployment guidelines to mitigate financial risks in cost-sensitive enterprise environments.
翻译:尽管文本到SQL系统已实现较高准确率,但现有效率指标(如有效效率评分)优先考虑执行时间——我们证明该指标与基于消耗量的云计费模式存在根本性脱节。本文通过在Google BigQuery平台上使用230GB的StackOverflow数据集,对六种大语言模型执行180次文本到SQL查询,系统评估了推理型与非推理型大语言模型的云查询执行成本权衡。分析表明:推理模型在保持96.7%至100%等效正确率的同时,处理字节数比非推理模型减少44.5%;执行时间与查询成本相关性较弱(r=0.16),表明速度优化并不等同于成本效益。非推理模型还表现出高达3.4倍的极端成本波动,部分异常查询超过36GB/次,达到最优模型平均1.8GB的20倍以上,其主要原因为分区过滤缺失和低效连接操作。我们识别了这些普遍存在的低效模式,并为成本敏感型企业环境提供了降低财务风险的部署指南。