The emergence of Large Language Models (LLMs) has revolutionized many fields, not only traditional natural language processing (NLP) tasks. Recently, research on applying LLMs to the database field has been booming, and as a typical non-relational database, the use of LLMs in graph database research has naturally gained significant attention. Recent efforts have increasingly focused on leveraging LLMs to translate natural language into graph query language (NL2GQL). Although some progress has been made, these methods have clear limitations, such as their reliance on streamlined processes that often overlook the potential of LLMs to autonomously plan and collaborate with other LLMs in tackling complex NL2GQL challenges. To address this gap, we propose NAT-NL2GQL, a novel multi-agent framework for translating natural language to graph query language. Specifically, our framework consists of three synergistic agents: the Preprocessor agent, the Generator agent, and the Refiner agent. The Preprocessor agent manages data processing as context, including tasks such as name entity recognition, query rewriting, path linking, and the extraction of query-related schemas. The Generator agent is a fine-tuned LLM trained on NL-GQL data, responsible for generating corresponding GQL statements based on queries and their related schemas. The Refiner agent is tasked with refining the GQL or context using error information obtained from the GQL execution results. Given the scarcity of high-quality open-source NL2GQL datasets based on nGQL syntax, we developed StockGQL, a dataset constructed from a financial market graph database. It is available at: https://github.com/leonyuancode/StockGQL. Experimental results on the StockGQL and SpCQL datasets reveal that our method significantly outperforms baseline approaches, highlighting its potential for advancing NL2GQL research.
翻译:大型语言模型(LLM)的出现不仅革新了传统的自然语言处理(NLP)任务,也深刻影响了众多其他领域。近年来,将LLM应用于数据库领域的研究蓬勃发展,而图数据库作为一种典型的非关系型数据库,其与LLM结合的研究自然也受到了广泛关注。当前的研究重点日益聚焦于利用LLM将自然语言翻译成图查询语言(NL2GQL)。尽管已取得一些进展,但现有方法存在明显局限,例如它们通常依赖于简化的流程,往往忽视了LLM在应对复杂NL2GQL挑战时自主规划以及与其他LLM协同合作的潜力。为弥补这一不足,我们提出了NAT-NL2GQL,一种用于自然语言到图查询语言翻译的新型多智能体框架。具体而言,我们的框架由三个协同工作的智能体构成:预处理智能体、生成智能体和优化智能体。预处理智能体负责管理作为上下文的数据处理任务,包括命名实体识别、查询重写、路径链接以及查询相关模式的提取。生成智能体是一个在NL-GQL数据上微调过的LLM,负责根据查询及其相关模式生成相应的GQL语句。优化智能体的任务是利用从GQL执行结果中获得的错误信息,对GQL或上下文进行优化。鉴于基于nGQL语法的高质量开源NL2GQL数据集稀缺,我们构建了StockGQL,这是一个从金融市场图数据库构建的数据集,可通过以下链接获取:https://github.com/leonyuancode/StockGQL。在StockGQL和SpCQL数据集上的实验结果表明,我们的方法显著优于基线方法,凸显了其在推动NL2GQL研究方面的潜力。