Socio-cognitive benchmarks for large language models (LLMs) often fail to predict real-world behavior, even when models achieve high benchmark scores. Prior work has attributed this evaluation-deployment gap to problems of measurement and validity. While these critiques are insightful, we argue that they overlook a more fundamental issue: many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability, leaving the assumptions linking task performance to competence implicit. Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence: a gap that creates a systemic validity illusion by masking the failure to evaluate the capability's other essential dimensions. To address this gap, we make two contributions. First, we diagnose and formalize this theory gap as a foundational failure that undermines measurement and enables systematic overgeneralization of benchmark results. Second, we introduce the Theory Trace Card (TTC), a lightweight documentation artifact designed to accompany socio-cognitive evaluations, which explicitly outlines the theoretical basis of an evaluation, the components of the target capability it exercises, its operationalization, and its limitations. We argue that TTCs enhance the interpretability and reuse of socio-cognitive evaluations by making explicit the full validity chain, which links theory, task operationalization, scoring, and limitations, without modifying benchmarks or requiring agreement on a single theory.
翻译:大语言模型(LLMs)的社会认知基准测试常难以预测其实际行为,即使模型在基准测试中获得高分。先前研究将这种评估与部署之间的差距归因于测量与效度问题。尽管这些批评具有洞察力,我们认为其忽视了一个更根本的问题:许多社会认知评估在未明确界定目标能力理论框架的情况下进行,使得任务表现与能力之间的关联假设处于隐含状态。缺乏这种理论根基时,仅检验能力狭窄子集的基准测试常被误读为广泛能力的证据——这种差距通过掩盖对能力其他关键维度的评估缺失,形成了系统性的效度幻觉。为应对此问题,我们做出两项贡献:首先,我们将这种理论缺失诊断并形式化为一种根本性缺陷,它既削弱测量效度,又导致基准测试结果的系统性过度泛化;其次,我们提出理论追踪卡(TTC)——一种轻量级文档工具,旨在伴随社会认知评估使用,明确勾勒评估的理论基础、目标能力的构成要素、操作化过程及其局限性。我们认为,TTC通过显式呈现连接理论、任务操作化、评分与局限性的完整效度链,能够在不修改基准测试或要求理论统一的前提下,增强社会认知评估的可解释性与可复用性。