We introduce SQLSpace, a human-interpretable, generalizable, compact representation for text-to-SQL examples derived with minimal human intervention. We demonstrate the utility of these representations in evaluation with three use cases: (i) closely comparing and contrasting the composition of popular text-to-SQL benchmarks to identify unique dimensions of examples they evaluate, (ii) understanding model performance at a granular level beyond overall accuracy scores, and (iii) improving model performance through targeted query rewriting based on learned correctness estimation. We show that SQLSpace enables analysis that would be difficult with raw examples alone: it reveals compositional differences between benchmarks, exposes performance patterns obscured by accuracy alone, and supports modeling of query success.
翻译:我们提出了SQLSpace,这是一种人类可解释、可泛化且紧凑的表示方法,用于文本到SQL示例,其构建过程仅需最少的人工干预。我们通过三个用例展示了这些表示在评估中的实用性:(i)深入比较和对比流行文本到SQL基准测试的构成,以识别它们所评估示例的独特维度;(ii)在整体准确率之外,从更细粒度层面理解模型性能;(iii)基于学习到的正确性估计,通过有针对性的查询重写来提升模型性能。我们证明,SQLSpace支持的分析难以仅通过原始示例实现:它揭示了基准测试之间的构成差异,暴露了仅凭准确率所掩盖的性能模式,并支持对查询成功率的建模。