Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as "Text-to-Big SQL". However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text-to-SQL metrics are insufficient for Big Data. In contrast, our proposed text-to-Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. Furthermore, we provide LLM-specific insights, including fine-grained, cross-model comparisons of latency and cost.
翻译:文本到SQL与大数据均是得到广泛基准测试的研究领域,然而对二者结合评估的研究却十分有限。在现实场景中,文本到SQL系统常被嵌入大数据工作流,例如大规模数据处理或交互式数据分析。我们将此称为"文本到大数据SQL"。然而,现有文本到SQL基准测试范围仍显局限,且忽视了规模扩展时产生的成本与性能影响。例如,在小数据集上微不足道的翻译错误,会随着数据规模扩大导致显著的成本与延迟开销——这一关键问题完全被现有文本到SQL评估指标所忽略。本文通过引入新颖且具代表性的评估指标来应对这一被忽视的挑战,以评估文本到大数据SQL系统。我们的研究聚焦于生产级LLM智能体——一种可适配多样化用户需求且与数据库无关的系统。通过对前沿模型的大规模评估,我们证明传统文本到SQL指标在大数据场景下存在不足。相比之下,我们提出的文本到大数据SQL指标能准确反映执行效率、成本及数据规模的影响。此外,我们提供了针对LLM的深度洞见,包括跨模型在延迟与成本方面的细粒度对比分析。