Quantifying the semantic similarity between database queries is a critical challenge with broad applications, ranging from query log analysis to automated educational assessment of SQL skills. Traditional methods often rely solely on syntactic comparisons or are limited to checking for semantic equivalence. This paper introduces a novel graph-based approach to measure the semantic dissimilarity between SQL queries. Queries are represented as nodes in an implicit graph, while the transitions between nodes are called edits, which are weighted by semantic dissimilarity. We employ shortest path algorithms to identify the lowest-cost edit sequence between two given queries, thereby defining a quantifiable measure of semantic distance. A prototype implementation of this technique has been evaluated through an empirical study, which strongly suggests that our method provides more accurate and comprehensible grading compared to existing techniques. Moreover, the results indicate that our approach comes close to the quality of manual grading, making it a robust tool for diverse database query comparison tasks.
翻译:量化数据库查询之间的语义相似度是一项关键挑战,具有从查询日志分析到SQL技能自动化教育评估的广泛应用。传统方法往往仅依赖语法比较,或局限于检查语义等价性。本文提出了一种新颖的基于图的方法,用于衡量SQL查询之间的语义差异。查询被表示为隐式图中的节点,节点之间的转换称为编辑操作,并通过语义差异进行加权。我们采用最短路径算法来识别两个给定查询之间成本最低的编辑序列,从而定义了一种可量化的语义距离度量。通过实证研究对该技术的原型实现进行了评估,结果表明,与传统技术相比,我们的方法能提供更准确且更易理解的评分。此外,结果指出,该方法已接近人工评分的质量,使其成为适用于多种数据库查询比较任务的稳健工具。