Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at https://anonymous.4open.science/r/StructEval-2435.
翻译:尽管在文本到SQL基准测试中表现强劲,但大语言模型生成的SQL程序在结构上是否可靠仍不明确。本文研究了LLM生成的SQL查询的结构行为,并提出了SQLStructEval框架,该框架通过规范抽象语法树(AST)表示分析程序结构。我们在Spider基准测试上的实验表明,即使执行结果正确,现代LLM通常会对同一输入生成结构多样化的查询,且这种差异常由表层输入变化(如释义或模式呈现方式)引发。进一步研究表明,通过编译式流水线在结构化空间中生成查询既能提升执行准确性,又能增强结构一致性。这些发现表明,结构可靠性是评估基于LLM的程序生成系统中一个关键但常被忽视的维度。我们的代码已开源在https://anonymous.4open.science/r/StructEval-2435。