As large language models (LLMs) become increasingly powerful, traditional evaluation metrics tend to saturate, making it challenging to distinguish between models based on their performance. We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. These enhanced evaluations emphasize reasoning capabilities and can reveal relative performance differences that are not apparent in the original assessments. To demonstrate the effectiveness of our approach, we create a new multiple-choice test corpus, extend it into a family of evaluations, and assess a collection of LLMs. Our results offer insights into the comparative reasoning abilities of these models, particularly highlighting distinctions between OpenAI's o1-preview and Google's gemini-pro-1.5-002.
翻译:随着大语言模型(LLM)日益强大,传统评估指标趋于饱和,使得基于性能区分模型变得困难。我们提出一种通用方法,可将现有LLM评估转化为一系列难度逐步增加的任务。这些增强型评估强调推理能力,并能揭示原始评估中不明显的相对性能差异。为验证方法的有效性,我们创建了一个新的多项选择测试语料库,将其扩展为一个评估系列,并对一系列LLM进行了评估。我们的结果揭示了这些模型的比较推理能力,特别凸显了OpenAI的o1-preview与Google的gemini-pro-1.5-002之间的差异。