Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar -- Mistral achieves the highest d' but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d' framework reveals which models "know what they don't know" versus which merely appear well-calibrated due to criterion placement -- a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.
翻译:标准的大语言模型置信度评估依赖于校准指标(期望校准误差、布里尔分数),这类指标混淆了两种不同能力:模型掌握知识的能力(I型灵敏度)与模型对自身知识知晓程度的元认知能力(II型元认知灵敏度)。我们提出基于II型信号检测理论的评估框架,通过元检测率(meta-d')和元认知效率比(M-ratio)对这两种能力进行分解。在224,000个事实性问答试验中,对四个大语言模型(Llama-3-8B-Instruct、Mistral-7B-Instruct-v0.3、Llama-3-8B-Base、Gemma-2-9B-Instruct)的应用发现:(1)当I型灵敏度相近时,不同模型的元认知效率存在显著差异——Mistral模型取得最高检测率(d')却拥有最低M-ratio值;(2)元认知效率具有领域特异性,不同模型展现出差异化的弱项领域,而聚合指标无法显现;(3)温度参数操纵会改变II型决策标准,但四个模型中的两个元检测率保持稳定,表明置信度策略与元认知能力存在分离;(4)II型受试者工作特征曲线下面积(AUROC_2)与M-ratio产生完全颠倒的模型排序,证明两类指标回答的是本质不同的评估问题。元检测率框架揭示了哪些模型"知晓自身未知领域",哪些仅因决策标准设置而看似校准良好——这一区分对模型选择、部署及人机协作具有直接启示。本研究经预注册分析,代码与数据均已公开。