Cypher, the query language for Neo4j graph databases, plays a critical role in enabling graph-based analytics and data exploration. While substantial research has been dedicated to natural language to SQL query generation (Text2SQL), the analogous problem for graph databases referred to as Text2Cypher remains underexplored. In this work, we introduce SynthCypher, a fully synthetic and automated data generation pipeline designed to address this gap. SynthCypher employs a novel LLMSupervised Generation-Verification framework, ensuring syntactically and semantically correct Cypher queries across diverse domains and query complexities. Using this pipeline, we create SynthCypher Dataset, a large-scale benchmark containing 29.8k Text2Cypher instances. Fine-tuning open-source large language models (LLMs), including LLaMa-3.1- 8B, Mistral-7B, and QWEN-7B, on SynthCypher yields significant performance improvements of up to 40% on the Text2Cypher test set and 30% on the SPIDER benchmark adapted for graph databases. This work demonstrates that high-quality synthetic data can effectively advance the state-of-the-art in Text2Cypher tasks.
翻译:Cypher作为Neo4j图数据库的查询语言,在图驱动的分析与数据探索中发挥着关键作用。尽管针对自然语言到SQL查询生成(Text2SQL)已有大量研究,但面向图数据库的同类问题——即Text2Cypher——仍未得到充分探索。本文提出SynthCypher,一个全合成、自动化的数据生成流程,旨在填补这一空白。SynthCypher采用了一种新颖的LLM监督生成-验证框架,确保跨不同领域与查询复杂度的Cypher查询在语法和语义上的正确性。利用该流程,我们构建了SynthCypher数据集——一个包含29.8k个Text2Cypher实例的大规模基准。在SynthCypher数据上对开源大语言模型(包括LLaMa-3.1-8B、Mistral-7B与QWEN-7B)进行微调,可在Text2Cypher测试集上实现高达40%的性能提升,并在适配于图数据库的SPIDER基准上获得30%的改进。本研究表明,高质量的合成数据能够有效推动Text2Cypher任务的技术前沿。