Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. We also show that simple linear probes of the embedding space recover cross-subject ability directions, such as an arithmetic axis that highlights quantitatively demanding questions in seemingly distant subjects like virology and global facts. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.
翻译:标准的大语言模型评估实践将多样化的能力压缩为单一分数,掩盖了其固有的多维本质。我们提出JE-IRT,一种几何项目响应框架,将大语言模型与问题共同嵌入共享空间。在问题嵌入中,方向编码语义信息,范数编码难度,而每个问题的正确性由模型与问题嵌入之间的几何交互决定。这种几何结构用主题专业化取代了全局的LLM排名,并使得相关问题之间能够平滑变化。基于该框架,我们的实验结果表明,分布外行为可通过方向对齐来解释,且更大的范数始终对应更难的问题。此外,JE-IRT天然支持泛化:一旦空间被学习,新LLM可通过拟合单一嵌入的方式加入。学习到的空间进一步揭示了仅部分与人类定义学科类别对齐的LLM内部类别体系。我们还展示了嵌入空间的简单线性探针能够恢复跨学科的能力方向,例如一个算术轴可突出病毒学、全球事实等看似遥远学科中具有定量要求的问题。因此,JE-IRT建立了一个统一且可解释的几何视角,将LLM能力与问题结构联系起来,为模型评估与泛化提供了独特视角。