Property Graphs are rapidly being adopted as database frameworks for representing heterogeneous data sources. To enable precise access to the information contained in them we need conversational interfaces based on Text-To-Cypher (Text2Cypher) parsers. This paper presents an automatic synthetic data generation method that can be leveraged to fine-tune small LLMs for this task. We conduct experiments on all the major Text-To-Cypher benchmarks, demonstrating that with our synthetic data generation approach we can significantly increase the performance of small LLMs, allowing them to compete with much larger proprietary models. This means that in settings in which models must be locally deployed we can ensure data-sovereignty without sacrificing accuracy and without costly annotation campaigns.
翻译:属性图正被迅速采用为表示异构数据源的数据库框架。为了实现对其中所包含信息的精确访问,我们需要基于文本到Cypher(Text2Cypher)解析器的对话式接口。本文提出了一种自动合成数据生成方法,可用于微调小型大语言模型(LLM)以完成该任务。我们在所有主要的文本到Cypher基准测试上进行了实验,证明通过我们的合成数据生成方法,可以显著提升小型LLM的性能,使其能够与更大的专有模型相竞争。这意味着在必须本地部署模型的场景中,我们能够在不牺牲准确性的前提下确保数据主权,且无需昂贵的人工标注工作。