We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.
翻译:我们提出了phi-4,这是一个拥有140亿参数的语言模型,其训练方法的核心在于对数据质量的高度重视。与大多数主要基于网页内容或代码等有机数据源进行预训练的语言模型不同,phi-4在整个训练过程中策略性地融入了合成数据。虽然Phi系列的前代模型在很大程度上蒸馏了教师模型(特别是GPT-4)的能力,但phi-4在STEM领域的问答能力上显著超越了其教师模型,这证明我们的数据生成和训练后处理技术超越了单纯的蒸馏。尽管phi-4的架构相对phi-3改动极小,但由于数据质量的提升、训练课程的优化以及训练后方案中的创新,该模型在其参数量级上实现了强劲的性能表现——尤其是在侧重于推理的基准测试中。