Dialect-Agnostic SQL Parsing via LLM-Based Segmentation

SQL is a widely adopted language for querying data, which has led to the development of various SQL analysis and rewriting tools. However, due to the diversity of SQL dialects, such tools often fail when encountering unrecognized dialect-specific syntax. While Large Language Models (LLMs) have shown promise in understanding SQL queries, their inherent limitations in handling hierarchical structures and hallucination risks limit their direct applicability in parsing. To address these limitations, we propose SQLFlex, a novel query rewriting framework that integrates grammar-based parsing with LLM-based segmentation to parse diverse SQL dialects robustly. Our core idea is to decompose hierarchical parsing to sequential segmentation tasks, which better aligns with the strength of LLMs and improves output reliability through validation checks. Specifically, SQLFlex uses clause-level segmentation and expression-level segmentation as two strategies that decompose elements on different levels of a query. We extensively evaluated SQLFlex on both real-world use cases and in a standalone evaluation. In SQL linting, SQLFlex outperforms SQLFluff in ANSI mode by 63.68% in F1 score while matching its dialect-specific mode performance. In test-case reduction, SQLFlex outperforms SQLess by up to 10 times in simplification rate. In the standalone evaluation, it parses 91.55% to 100% of queries across eight distinct dialects, outperforming all baseline parsers. We believe SQLFlex can serve as a foundation for many query analysis and rewriting use cases.

翻译：SQL是一种广泛用于数据查询的语言，这推动了各类SQL分析与重写工具的发展。然而，由于SQL方言的多样性，此类工具在遇到无法识别的方言特定语法时常常失效。尽管大语言模型在理解SQL查询方面展现出潜力，但其在处理层次结构方面的固有局限以及幻觉风险限制了它们在解析中的直接适用性。为应对这些局限，我们提出了SQLFlex，一种新颖的查询重写框架，它将基于语法的解析与基于大语言模型的分段相结合，以稳健地解析多种SQL方言。我们的核心思想是将层次化解析分解为序列化分段任务，这更好地契合了大语言模型的优势，并通过验证检查提高了输出可靠性。具体而言，SQLFlex采用子句级分段和表达式级分段两种策略，对查询的不同层级元素进行分解。我们在实际应用场景和独立评估中广泛测试了SQLFlex。在SQL代码检查任务中，SQLFlex在ANSI模式下以63.68%的F1分数超越SQLFluff，同时与其方言特定模式性能相当。在测试用例简化任务中，SQLFlex的简化率最高可达SQLess的10倍。在独立评估中，它在八种不同方言上实现了91.55%至100%的查询解析成功率，优于所有基线解析器。我们相信SQLFlex可为众多查询分析与重写应用场景提供基础支撑。