With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.
翻译:随着大型语言模型(LLM)的快速发展,亟需一套综合评估框架来系统衡量其能力与局限性。现有LLM排行榜常引用其他论文的得分,却未采用统一设置与提示词,这可能在无意中助长选择有利设置与提示词以求更优结果的做法。为此,我们提出GPT-Fathom——一个基于OpenAI Evals构建的开源可复现LLM评估套件。我们系统评估了10余个领先LLM及OpenAI历史模型,在统一设置下对涵盖7大能力类别的20余个精选基准进行了测试。对OpenAI早期模型的回顾性研究,为理解从GPT-3到GPT-4的进化路径提供了宝贵洞见。当前学界亟需了解GPT-3如何逐步演进至GPT-4,包括以下技术细节:代码数据的加入是否提升了LLM推理能力?监督微调(SFT)与基于人类反馈的强化学习(RLHF)分别能改进LLM哪些能力?对齐代价(alignment tax)有多大?我们的分析揭示了上述诸多问题,旨在提升先进LLM的透明度。