Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics platforms. This paper introduces NL2KQL an innovative framework that uses large language models (LLMs) to convert natural language queries (NLQs) to KQL queries. The proposed NL2KQL framework includes several key components: Schema Refiner which narrows down the schema to its most pertinent elements; the Few-shot Selector which dynamically selects relevant examples from a few-shot dataset; and the Query Refiner which repairs syntactic and semantic errors in KQL queries. Additionally, this study outlines a method for generating large datasets of synthetic NLQ-KQL pairs which are valid within a specific database contexts. To validate NL2KQL's performance, we utilize an array of online (based on query execution) and offline (based on query parsing) metrics. Through ablation studies, the significance of each framework component is examined, and the datasets used for benchmarking are made publicly available. This work is the first of its kind and is compared with available baselines to demonstrate its effectiveness.
翻译:随着数据在规模和复杂性上的快速增长,掌握数据库查询语言对于高效构建查询至关重要。随着编码助手的日益普及,增强数据库查询语言的能力迎来了重要机遇。Kusto查询语言(KQL)是一种广泛应用于大数据分析平台的查询语言,适用于日志、遥测数据和时间序列等半结构化大数据。本文提出NL2KQL这一创新框架,利用大型语言模型(LLMs)将自然语言查询(NLQs)转换为KQL查询。该框架包含若干关键组件:Schema Refiner用于将模式精简至最相关要素;Few-shot Selector从少样本数据集中动态选择相关示例;Query Refiner用于修复KQL查询中的语法与语义错误。此外,本研究还提出一种在特定数据库上下文中生成大规模合成NLQ-KQL配对数据集的方法。为验证NL2KQL的性能,我们采用一系列在线(基于查询执行)与离线(基于查询解析)评估指标。通过消融实验分析各组件的重要性,并公开基准测试所用数据集。该工作属首创性研究,通过与现有基线方法的对比验证了其有效性。