Text-to-SQL models have significantly improved with the adoption of Large Language Models (LLMs), leading to their increasing use in real-world applications. Although many benchmarks exist for evaluating the performance of text-to-SQL models, they often rely on a single aggregate score, lack evaluation under realistic settings, and provide limited insight into model behaviour across different query types. In this work, we present SQLyzr, a comprehensive benchmark and evaluation platform for text-to-SQL models. SQLyzr incorporates a diverse set of evaluation metrics that capture multiple aspects of generated queries, while enabling more realistic evaluation through workload alignment with real-world SQL usage patterns and database scaling. It further supports fine-grained query classification, error analysis, and workload augmentation, allowing users to better diagnose and improve text-to-SQL models. This demonstration showcases these capabilities through an interactive experience. Through SQLyzr's graphical interface, users can customize evaluation settings, analyze fine-grained reports, and explore additional features of the platform. We envision that SQLyzr facilitates the evaluation and iterative improvement of text-to-SQL models by addressing key limitations of existing benchmarks. The source code of SQLyzr is available at https://github.com/sepideh-abedini/SQLyzr.
翻译:随着大语言模型(LLMs)的应用,文本到SQL模型已取得显著提升,进而越来越多地被部署于实际场景。尽管存在诸多评估文本到SQL模型性能的基准,但它们往往依赖单一聚合评分、缺乏真实场景下的评估,且对不同查询类型下的模型行为提供的洞察有限。本文提出SQLyzr——一个面向文本到SQL模型的综合性基准与评估平台。SQLyzr融合了多维度评估指标以捕获生成查询的多个方面,同时通过工作负载与真实SQL使用模式的对齐及数据库缩放,实现更逼真的评估。该平台进一步支持细粒度查询分类、误差分析和工作负载增强,使用户能更有效地诊断并改进文本到SQL模型。本演示通过交互式体验展示了这些功能。借助SQLyzr图形界面,用户可定制评估设置、分析细粒度报告并探索平台其他特性。我们预见,SQLyzr通过弥补现有基准的关键局限,将促进文本到SQL模型的评估与迭代优化。SQLyzr源代码发布于https://github.com/sepideh-abedini/SQLyzr。