In this work, we dive into the fundamental challenges of evaluating Text2SQL solutions and highlight potential failure causes and the potential risks of relying on aggregate metrics in existing benchmarks. We identify two largely unaddressed limitations in current open benchmarks: (1) data quality issues in the evaluation data, mainly attributed to the lack of capturing the probabilistic nature of translating a natural language description into a structured query (e.g., NL ambiguity), and (2) the bias introduced by using different match functions as approximations for SQL equivalence. To put both limitations into context, we propose a unified taxonomy of all Text2SQL limitations that can lead to both prediction and evaluation errors. We then motivate the taxonomy by providing a survey of Text2SQL limitations using state-of-the-art Text2SQL solutions and benchmarks. We describe the causes of limitations with real-world examples and propose potential mitigation solutions for each category in the taxonomy. We conclude by highlighting the open challenges encountered when deploying such mitigation strategies or attempting to automatically apply the taxonomy.
翻译:本文深入探讨了评估Text2SQL解决方案的根本挑战,重点指出了现有基准测试中潜在的失败原因以及依赖聚合指标可能带来的风险。我们识别出当前开放基准测试中两个在很大程度上未被充分解决的局限性:(1) 评估数据中存在的数据质量问题,主要归因于未能充分捕捉将自然语言描述转换为结构化查询的概率性本质(例如自然语言的歧义性);(2) 使用不同的匹配函数作为SQL等价性近似而引入的偏差。为了将这两种局限性置于具体情境中,我们提出了一个统一的Text2SQL局限性分类法,该分类法涵盖了可能导致预测错误和评估错误的所有情况。随后,我们通过对现有最先进的Text2SQL解决方案和基准测试进行调研,论证了该分类法的必要性。我们结合真实案例描述了各类局限性的成因,并为分类法中的每个类别提出了潜在的缓解方案。最后,我们强调了在部署此类缓解策略或尝试自动应用该分类法时所面临的开放性挑战。