While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.
翻译:尽管大语言模型(LLMs)在海量数据上进行了预训练,但在数据稀缺的专业领域,其知识覆盖仍不完整,这促使研究者广泛探索利用合成数据进行知识注入。我们提出SPA(Scaling Prompt-engineered Augmentation,扩展提示工程增强)——一种简单但难以超越的基线方法,通过少量精心设计的提示生成大规模合成数据以实现知识注入。通过系统性比较,我们发现SPA优于多个强基线方法。此外,我们识别出先前方法的两个关键局限:(1)基于强化学习的方法在小规模下可能提升基于LLM数据增强的token效率,但随着数据规模扩大,其多样性崩溃导致边际收益递减;(2)多阶段提示方法可能优于简单增强方法,但在仔细调整提示后其优势可能消失。我们的结果表明,对于知识注入而言,精心设计提示并配合直接的大规模增强可产生惊人的效果。我们期待SPA能成为该领域未来研究的强有力基线。代码已开源在https://github.com/Tangkexian/SPA。