Text-to-SQL models have significantly improved with the adoption of Large Language Models (LLMs), leading to their increasing use in real-world applications. Although many benchmarks exist for evaluating the performance of text-to-SQL models, they often rely on a single aggregate score, lack evaluation under realistic settings, and provide limited insight into model behaviour across different query types. In this work, we present SQLyzr, a comprehensive benchmark and evaluation platform for text-to-SQL models. SQLyzr incorporates a diverse set of evaluation metrics that capture multiple aspects of generated queries, while enabling more realistic evaluation through workload alignment with real-world SQL usage patterns and database scaling. It further supports fine-grained query classification, error analysis, and workload augmentation, allowing users to better diagnose and improve text-to-SQL models. This demonstration showcases these capabilities through an interactive experience. Through SQLyzr's graphical interface, users can customize evaluation settings, analyze fine-grained reports, and explore additional features of the platform. We envision that SQLyzr facilitates the evaluation and iterative improvement of text-to-SQL models by addressing key limitations of existing benchmarks. The source code of SQLyzr is available at https://github.com/sepideh-abedini/SQLyzr.
翻译:文本到SQL模型随着大语言模型(LLM)的应用取得了显著进步,正越来越多地被部署于实际应用场景。尽管已有众多基准测试用于评估文本到SQL模型的性能,但这些基准往往仅依赖单一聚合分数、缺乏真实场景下的评估设置,且对不同查询类型下模型行为的洞察有限。本研究提出SQLyzr——一个面向文本到SQL模型的综合基准测试与评估平台。SQLyzr整合了多维度评估指标以捕捉生成查询的多个特性,同时通过匹配真实世界SQL使用模式的工作负载对齐和数据库规模缩放实现更真实的评估。该平台进一步支持细粒度查询分类、错误分析及工作负载增强,帮助用户更好地诊断和优化文本到SQL模型。本演示通过交互式体验展示了这些功能。借助SQLyzr的图形界面,用户可定制评估设置、分析细粒度报告并探索平台其他特色功能。我们预期SQLyzr通过弥补现有基准的关键局限,能够促进文本到SQL模型的评估与迭代改进。SQLyzr源代码已开源至 https://github.com/sepideh-abedini/SQLyzr。