The ability to generate SQL queries from natural language has significant implications for making data accessible to non-specialists. This paper presents a novel approach to fine-tuning open-source large language models (LLMs) for the task of transforming natural language into SQL queries within the retail domain. We introduce models specialized in generating SQL queries, trained on synthetic datasets tailored to the Snowflake SQL and GoogleSQL dialects. Our methodology involves generating a context-specific dataset using GPT-4, then fine-tuning three open-source LLMs(Starcoder Plus, Code-Llama, and Mistral) employing the LoRa technique to optimize for resource constraints. The fine-tuned models demonstrate superior performance in zero-shot settings compared to the baseline GPT-4, with Code-Llama achieving the highest accuracy rates, at 81.58% for Snowflake SQL and 82.66% for GoogleSQL. These results underscore the effectiveness of fine-tuning LLMs on domain-specific tasks and suggest a promising direction for enhancing the accessibility of relational databases through natural language interfaces.
翻译:从自然语言生成SQL查询对于让非专业人员访问数据具有重要意义。本文提出了一种新颖的方法,用于微调开源大语言模型(LLMs),以完成零售领域中自然语言到SQL查询的转换任务。我们介绍了专门用于生成SQL查询的模型,这些模型基于针对Snowflake SQL和GoogleSQL方言定制的合成数据集进行训练。我们的方法包括使用GPT-4生成上下文相关数据集,然后采用LoRa技术对三个开源LLM(Starcoder Plus、Code-Llama和Mistral)进行微调,以优化资源限制。在零样本设置中,这些微调模型相比基础GPT-4展现了更优的性能,其中Code-Llama取得了最高准确率,针对Snowflake SQL达到81.58%,针对GoogleSQL达到82.66%。这些结果强调了针对特定领域任务微调LLM的有效性,并为通过自然语言界面增强关系数据库的可访问性指明了有前景的方向。