With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.
翻译:随着大型语言模型(LLMs)的快速发展,迫切需要一套全面的评估套件来评估其能力与局限性。现有的LLM排行榜常引用其他论文中报告的分数,但未采用一致的设置和提示词,这可能无意中鼓励为获得更优结果而选择性挑选有利设置与提示词。在本工作中,我们提出了GPT-Fathom,一个基于OpenAI Evals构建的开源且可复现的LLM评估套件。我们在7个能力类别的20余个精选基准上,采用统一设置系统评估了10余个领先LLM以及OpenAI的早期模型。我们对OpenAI早期模型的回顾性研究为从GPT-3到GPT-4的演化路径提供了宝贵洞见。目前,社区迫切希望了解GPT-3如何逐步改进至GPT-4,包括技术细节如代码数据的增加是否提升了LLM的推理能力、LLM的哪些能力可通过SFT和RLHF得到改善、对齐代价有多大等。我们的分析揭示了这些问题的诸多方面,旨在提升先进LLM的透明度。