Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model's capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca (We are working with our legal team to publicly release a diff of the model weights in accordance with LLaMA's release policy to be published at https://aka.ms/orca-lm), a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. To promote this progressive learning, we tap into large-scale and diverse imitation data with judicious sampling and selection. Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4. Our research indicates that learning from step-by-step explanations, whether these are generated by humans or more advanced AI models, is a promising direction to improve model capabilities and skills.

翻译：近期研究聚焦于通过模仿学习增强小模型的能力，其基础是大规模基础模型（LFM）生成的输出。然而，当前模型质量受多重因素制约：浅层LFM输出提供的有限模仿信号、小规模同质训练数据，以及最为关键的严格评估缺失——这导致小模型倾向于模仿LFM的风格而非推理过程，从而高估其实际能力。为应对这些挑战，我们开发了Orca（我们正与法律团队合作，依据LLaMA发布政策公开模型权重的差异版本，预计于https://aka.ms/orca-lm发布），这是一个130亿参数的模型，旨在学习模仿LFM的推理过程。Orca从GPT-4的丰富信号中学习，包括解释轨迹、逐步思考过程及其他复杂指令，并借助ChatGPT的教师辅助引导。为促进这种渐进式学习，我们通过审慎采样与选择，构建了大规模、多样化的模仿数据。在复杂零样本推理基准测试（如Big-Bench Hard，BBH）中，Orca较传统最先进的指令微调模型（如Vicuna-13B）性能提升超过100%，在AGIEval上提升42%。此外，Orca在BBH基准上达到与ChatGPT相当的性能，并在SAT、LSAT、GRE及GMAT等专业与学术考试中（无需思维链的零样本设置下）表现出竞争力（与优化系统消息的差距仅4分），但仍落后于GPT-4。我们的研究表明，无论是人类还是更先进的AI模型生成的逐步解释，均是提升模型能力与技能的有效途径。