With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.
翻译:随着大语言模型(LLMs)的快速发展,亟需一套全面的评测体系来评估其能力和局限性。现有LLM排行榜常直接引用其他论文中的分数,却未统一设置和提示词,这可能在无意间鼓励挑选更有利的设置和提示词以获得更优结果。本研究提出GPT-Fathom——基于OpenAI Evals构建的开源可复现LLM评测套件。我们在统一设置下,系统评估了10余个领先LLM及OpenAI的历史模型,涵盖7大能力类别的20余个精选基准测试。通过对OpenAI早期模型的回溯研究,我们获得了从GPT-3到GPT-4进化路径的宝贵洞见。当前学界迫切希望了解GPT-3如何逐步演进至GPT-4,包括具体技术细节:如加入代码数据是否提升LLM推理能力、SFT与RLHF可改善LLM哪些方面能力、对齐代价究竟多大等。我们的分析为上述问题提供了启发,旨在提升先进LLM的技术透明度。