In this paper, we propose a novel graph-based methodology to evaluate the functional correctness of SQL generation. Conventional metrics for assessing SQL code generation, such as matching-based and execution-based methods (e.g., exact set match and execution accuracy), are subject to two primary limitations. Firstly, the former fails to effectively assess functional correctness, as different SQL queries may possess identical functionalities. Secondly, the latter is susceptible to producing false positive samples in evaluations. Our proposed evaluation method, \texttt{FuncEvalGMN}, does not depend on the sufficient preparation of the test data, and it enables precise testing of the functional correctness of the code. Firstly, we parse SQL using a relational operator tree (ROT) called \textit{Relnode}, which contains rich semantic information from the perspective of logical execution.Then, we introduce a GNN-based approach for predicting the functional correctness of generated SQL. This approach incorporates global positional embeddings to address the limitations with the loss of topological information in conventional graph matching frameworks. As an auxiliary contribution, we propose a rule-based matching algorithm, Relnode Partial Matching (\texttt{RelPM}) as a baseline. Finally, we contribute a dataset, \texttt{Pair-Aug-Spider} with a training set and two testing sets, each comprising pairs of SQL codes to simulate various SQL code evaluation scenarios. The training set and one testing dataset focus on code generation using large language models (LLMs), while the other emphasizes SQL equivalence rewriting.
翻译:本文提出了一种新颖的基于图的方法来评估SQL生成的功能正确性。评估SQL代码生成的常规指标,例如基于匹配和基于执行的方法(例如精确集合匹配和执行准确率),存在两个主要局限性。首先,前者无法有效评估功能正确性,因为不同的SQL查询可能具有相同的功能。其次,后者在评估中容易产生假阳性样本。我们提出的评估方法 \texttt{FuncEvalGMN} 不依赖于测试数据的充分准备,并且能够精确测试代码的功能正确性。首先,我们使用一种称为 \textit{Relnode} 的关系运算符树(ROT)来解析SQL,该树从逻辑执行的角度包含了丰富的语义信息。然后,我们引入了一种基于GNN的方法来预测生成的SQL的功能正确性。该方法结合了全局位置嵌入,以解决传统图匹配框架中拓扑信息丢失的局限性。作为一项辅助贡献,我们提出了一种基于规则的匹配算法——Relnode部分匹配(\texttt{RelPM})作为基线。最后,我们贡献了一个数据集 \texttt{Pair-Aug-Spider},包含一个训练集和两个测试集,每个数据集均由SQL代码对组成,以模拟各种SQL代码评估场景。训练集和一个测试数据集侧重于使用大语言模型(LLMs)进行代码生成,而另一个则强调SQL等价重写。