Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel methodologies to adapt cognitive diagnosis models (CDMs) in psychometrics to LLM evaluation, enabling fine-grained diagnosis via multidimensional discrete capability profiles and interpretable characterizations of LLM strengths and weaknesses. First, to enable CDM-based evaluation at benchmark scale (more than 1000 items), we propose a scalable method that jointly estimates LLM mastery profiles and the item-attribute Q-matrix, addressing key challenges posed by high-dimensional latent attributes (K > 20), large item pools, and the prohibitive computational cost of existing marginal maximum likelihood-based estimation. Second, we incorporate item-level textual information to construct AI-embedding-informed priors for the Q-matrix, stabilizing high-dimensional estimation while reducing reliance on costly human specification. We develop an efficient stochastic-approximation algorithm to jointly estimate LLM mastery profiles and the Q-matrix that balances data fit with text-embedding-informed priors. Simulation studies demonstrate accurate parameter recovery. An application to the MATH Level 5 benchmark illustrates the practical utility of our method for LLM evaluation and uncovers useful insights into LLMs' fine-grained capabilities.
翻译:大语言模型(LLM)在多样化基准测试中取得了显著性能,然而现有的评估实践主要依赖于粗糙的汇总指标,这些指标掩盖了底层的推理能力。在本工作中,我们提出新颖方法,将心理测量学中的认知诊断模型(CDM)适配于LLM评估,通过多维离散能力剖面和可解释的LLM优势与弱点描述,实现细粒度诊断。首先,为实现在基准测试规模(超过1000个项目)上基于CDM的评估,我们提出一种可扩展方法,联合估计LLM掌握度剖面和项目-属性Q矩阵,以应对高维潜在属性(K > 20)、大型项目池以及现有基于边际最大似然估计的过高计算成本带来的关键挑战。其次,我们引入项目级文本信息,为Q矩阵构建基于AI嵌入的先验分布,从而稳定高维估计,同时减少对昂贵人工标注的依赖。我们开发了一种高效的随机逼近算法,联合估计LLM掌握度剖面和Q矩阵,在数据拟合与文本嵌入先验之间取得平衡。仿真研究证明了准确的参数恢复能力。在MATH Level 5基准测试中的应用展示了本方法在LLM评估中的实际效用,并揭示了关于LLM细粒度能力的有价值见解。