Text-to-SQL models have significantly improved with the adoption of Large Language Models (LLMs), leading to their increasing use in real-world applications. Although many benchmarks exist for evaluating the performance of text-to-SQL models, they often rely on a single aggregate score, lack evaluation under realistic settings, and provide limited insight into model behaviour across different query types. In this work, we present SQLyzr, a comprehensive benchmark and evaluation platform for text-to-SQL models. SQLyzr incorporates a diverse set of evaluation metrics that capture multiple aspects of generated queries, while enabling more realistic evaluation through workload alignment with real-world SQL usage patterns and database scaling. It further supports fine-grained query classification, error analysis, and workload augmentation, allowing users to better diagnose and improve text-to-SQL models. This demonstration showcases these capabilities through an interactive experience. Through SQLyzr's graphical interface, users can customize evaluation settings, analyze fine-grained reports, and explore additional features of the platform. We envision that SQLyzr facilitates the evaluation and iterative improvement of text-to-SQL models by addressing key limitations of existing benchmarks. The source code of SQLyzr is available at https://github.com/sepideh-abedini/SQLyzr.
翻译:随着大语言模型(LLMs)的广泛应用,Text-to-SQL模型已取得显著进步,并逐步在真实场景中落地。尽管现有诸多基准可用于评估Text-to-SQL模型性能,但它们通常仅依赖单一聚合得分、缺乏面向实际设置的评估,且对模型在不同查询类型下的行为理解有限。本文提出SQLyzr——一个面向Text-to-SQL模型的综合基准与评估平台。SQLyzr整合了覆盖生成查询多维度特性的多样化评估指标,并通过与真实SQL使用模式及数据库规模扩展的工作负载对齐,实现更贴近实际的评估。平台进一步支持细粒度查询分类、错误分析及工作负载增强,使用户能够更有效地诊断与改进Text-to-SQL模型。本演示通过交互式体验展示上述能力:用户可通过SQLyzr图形界面自定义评估设置、分析细粒度报告并探索平台扩展功能。我们预期SQLyzr通过弥补现有基准的关键缺陷,推动Text-to-SQL模型的评估与迭代改进。SQLyzr源代码已开源:https://github.com/sepideh-abedini/SQLyzr。