Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores reveal about models' underlying capabilities. Here, we propose a new paradigm for LLM evaluation, by asking whether benchmark performance reflects many independent abilities, or rather relies on a small number of shared dimensions. To answer this, we apply Factor Analysis (FA) to a massive performance matrix of LLMs versus benchmarks \((60\times44)\) revealing an \emph{intrinsically low-rank} structure of that matrix. That is, a small number of latent factors captures most of the structure in the full task space. This low-rank geometry reveals substantial redundancy across existing tasks and explains why many benchmarks appear to be measuring overlapping abilities. We further show that these latent factors correspond to coherent, skill-like, dimensions of LLM behavior. Leveraging this latent skill-space, we deliver three practical tools for LLM evaluation and downstream users: (i)~identifying redundant tasks, (ii)~profiling new models using a small subset of tasks, and (iii)~selecting models aligned with desired skill profiles. Our method provides a solid alternative to the de-facto standard of a single aggregate score, and establishes an interpretable and practical framework for understanding and benchmarking LLM core capabilities.
翻译:当前对大型语言模型的评估高度依赖不断增长的基准集合及其聚合分数,但这种比较究竟捕捉到什么,这些分数又揭示了模型哪些潜在能力,依然尚不明确。为此,我们提出一种大模型评估新范式,探讨基准性能反映的是众多独立能力,还是依赖少量共享维度。为回答该问题,我们对大模型与基准构成的庞大规模性能矩阵 \((60\times44)\) 进行因子分析,揭示出该矩阵本质上的低秩结构——即少量潜在因子即可捕捉整个任务空间的主要结构。这种低秩几何特性揭示了现有任务间存在大量冗余,并解释了为何诸多基准看似在测量重叠的能力。我们进一步证明,这些潜在因子对应着大模型行为中连贯、类技能的能力维度。利用该潜在技能空间,我们为大模型评估与下游用户提供了三种实用工具:(i)识别冗余任务,(ii)通过少量任务子集对新模型进行画像,(iii)选择与所需技能画像对齐的模型。该方法为单一聚合分数的通行标准提供了坚实替代方案,并建立了可解释且实用的框架,以理解并基准测试大模型核心能力。