In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21\% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.
翻译:在数据驱动决策至关重要的商业领域,文本到 SQL 技术是实现通过自然语言便捷访问结构化数据的基础。尽管近期的大语言模型在代码生成方面取得了强劲的性能,但现有的文本到 SQL 基准测试仍主要聚焦于对过往记录的事实性检索。我们推出了 CORGI,一个专为现实商业场景设计的新型基准测试。CORGI 由受 Doordash、Airbnb 和 Lululemon 等企业启发的合成数据库构成。它提供了涵盖四个复杂度递增的商业查询类别的问题:描述性、解释性、预测性和建议性。这一挑战要求进行因果推理、时序预测和战略推荐,体现了多层次、多步骤的智能体智能。我们发现,大语言模型在高层级问题上的性能下降,难以做出准确预测并提供可执行的计划。基于执行成功率,CORGI 基准测试的难度比 BIRD 基准测试高出约 21\%。这凸显了当前流行的大语言模型与实际商业智能需求之间的差距。我们发布了公开数据集与评估框架,以及一个用于公开提交的网站。