Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data. Using this efficient generator, we conduct a comprehensive study on how various factors within the alignment pipeline impact data quality and reveal the scaling law for synthetic embedding data.
翻译:合成数据生成已成为一种日益流行的模型训练方法,无需依赖大规模人工标注数据集。对于文本嵌入等任务,合成数据提供了多样化且可扩展的训练样本,显著降低了人工标注成本。然而,当前大多数方法严重依赖GPT-4等专有模型,这些模型在生成大规模嵌入数据时成本高昂且效率低下。本文提出SPEED框架,通过对齐开源小模型(8B参数)来高效生成大规模合成嵌入数据。通过监督微调、偏好优化与自我改进机制,SPEED使开源小模型能够产出高质量数据。值得注意的是,SPEED仅使用不足1/10的GPT API调用量,在完全使用各自合成数据训练时,其性能超越了当前最先进的嵌入模型E5_mistral。借助该高效生成器,我们系统研究了对齐流程中各类因素对数据质量的影响,并揭示了合成嵌入数据的缩放规律。