Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model's unique "failure signature", clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.
翻译:大语言模型(LLM)的基准测试能告诉我们模型何时失败,却无法解释其失败原因。在推理数据集上的错误答案可能源于格式问题、计算错误或数据集噪声,而非推理能力薄弱。若不厘清这些原因,基准测试将始终存在缺陷,无法可靠指导模型改进。我们提出ErrorMap——首个用于绘制LLM失败根源的方法。该方法能提取模型独特的“失败特征”,澄清基准测试的度量目标,并通过扩展错误识别范围来减少盲区。这有助于开发者调试模型、对齐基准目标与结果,并支持基于充分信息的模型选择。ErrorMap可基于相同逻辑应用于任意模型与数据集。通过对35个数据集和83个模型应用本方法,我们构建了ErrorAtlas——一个模型错误分类体系,揭示了反复出现的失败模式。ErrorAtlas突显了当前LLM研究中尚未充分探索的错误类型,例如输出中必要细节的遗漏以及问题误解。通过将关注点从模型成功之处转向失败原因,ErrorMap与ErrorAtlas实现了进阶评估——这种评估能暴露隐藏缺陷并指引发展方向。与通常通过任务级指标衡量的成功不同,我们的方法引入了可在模型与任务间全局应用的深层评估维度,为模型行为与局限提供了更丰富的洞察。我们公开了分类体系与代码,并计划随着新基准和模型的出现定期更新ErrorAtlas。