Synthesizing high-quality reasoning data for continual training has been proven to be effective in enhancing the performance of Large Language Models (LLMs). However, previous synthetic approaches struggle to easily scale up data and incur high costs in the pursuit of high quality. In this paper, we propose the Graph-based Synthetic Data Pipeline (GSDP), an economical and scalable framework for high-quality reasoning data synthesis. Inspired by knowledge graphs, we extracted knowledge points from seed data and constructed a knowledge point relationships graph to explore their interconnections. By exploring the implicit relationships among knowledge, our method achieves $\times$255 data expansion. Furthermore, GSDP led by open-source models, achieves synthesis quality comparable to GPT-4-0613 while maintaining $\times$100 lower costs. To tackle the most challenging mathematical reasoning task, we present the GSDP-MATH dataset comprising over 1.91 million pairs of math problems and answers. After fine-tuning on GSDP-MATH, GSDP-7B based on Mistral-7B achieves 37.7% accuracy on MATH and 78.4% on GSM8K, demonstrating the effectiveness of our method. The dataset and models trained in this paper will be available.
翻译:通过合成高质量推理数据进行持续训练已被证明能有效提升大语言模型的性能。然而,现有合成方法难以实现数据规模的轻松扩展,且在追求高质量时往往成本高昂。本文提出基于图的合成数据管道,这是一种经济且可扩展的高质量推理数据合成框架。受知识图谱启发,我们从种子数据中提取知识点,并构建知识点关系图以探索其内在关联。通过挖掘知识间的隐含关系,我们的方法实现了255倍的数据扩展。此外,由开源模型驱动的GSDP在保持成本降低100倍的同时,实现了与GPT-4-0613相当的合成质量。为应对最具挑战性的数学推理任务,我们发布了包含超过191万道数学问题与答案对的GSDP-MATH数据集。基于Mistral-7B的GSDP-7B模型在GSDP-MATH上微调后,在MATH数据集上达到37.7%的准确率,在GSM8K数据集上达到78.4%的准确率,证明了本方法的有效性。本文训练的数据集与模型将公开提供。