Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics platforms. This paper introduces NL2KQL an innovative framework that uses large language models (LLMs) to convert natural language queries (NLQs) to KQL queries. The proposed NL2KQL framework includes several key components: Schema Refiner which narrows down the schema to its most pertinent elements; the Few-shot Selector which dynamically selects relevant examples from a few-shot dataset; and the Query Refiner which repairs syntactic and semantic errors in KQL queries. Additionally, this study outlines a method for generating large datasets of synthetic NLQ-KQL pairs which are valid within a specific database contexts. To validate NL2KQL's performance, we utilize an array of online (based on query execution) and offline (based on query parsing) metrics. Through ablation studies, the significance of each framework component is examined, and the datasets used for benchmarking are made publicly available. This work is the first of its kind and is compared with available baselines to demonstrate its effectiveness.
翻译:数据在规模和复杂性方面迅速增长。掌握数据库查询语言对于编写有效的查询至关重要。随着编码助手的日益普及,增强数据库查询语言的能力面临重大机遇。Kusto查询语言(KQL)是一种广泛使用的查询语言,适用于大数据分析平台中大型半结构化数据,如日志、遥测数据和时序数据。本文介绍了NL2KQL,这是一个创新框架,利用大型语言模型(LLMs)将自然语言查询(NLQs)转换为KQL查询。所提出的NL2KQL框架包含多个关键组件:Schema Refiner(模式精简器),用于将模式缩小至最相关的元素;Few-shot Selector(小样本选择器),从少样本数据集中动态选择相关示例;以及Query Refiner(查询优化器),用于修复KQL查询中的语法和语义错误。此外,本研究概述了一种方法,用于生成大量在特定数据库上下文中有效的合成NLQ-KQL对数据集。为验证NL2KQL的性能,我们使用了基于查询执行的在线指标和基于查询解析的离线指标。通过消融研究,检验了每个框架组件的重要性,并公开了用于基准测试的数据集。该工作是首个同类研究,并与现有基准进行了对比,以证明其有效性。