LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy

Large language models (LLMs) are evaluated for calibration using metrics such as Expected Calibration Error that conflate two distinct components: the model's ability to discriminate correct from incorrect answers (sensitivity) and its tendency toward confident or cautious responding (bias). Signal Detection Theory (SDT) decomposes these components. While SDT-derived metrics such as AUROC are increasingly used, the full parametric framework - unequal-variance model fitting, criterion estimation, z-ROC analysis - has not been applied to LLMs as signal detectors. In this pre-registered study, we treat three LLMs as observers performing factual discrimination across 168,000 trials and test whether temperature functions as a criterion shift analogous to payoff manipulations in human psychophysics. Critically, this analogy may break down because temperature changes the generated answer itself, not only the confidence assigned to it. Our results confirm the breakdown with temperature simultaneously increasing sensitivity (AUC) and shifting criterion. All models exhibited unequal-variance evidence distributions (z-ROC slopes 0.52-0.84), with instruct models showing more extreme asymmetry (0.52-0.63) than the base model (0.77-0.87) or human recognition memory (~0.80). The SDT decomposition revealed that models occupying distinct positions in sensitivity-bias space could not be distinguished by calibration metrics alone, demonstrating that the full parametric framework provides diagnostic information unavailable from existing metrics.

翻译：大语言模型（LLMs）的校准性能通常采用预期校准误差等指标进行评估，这些指标混淆了两个独立成分：模型区分正确答案与错误答案的能力（敏感性）及其倾向于自信或保守回答的倾向（偏差）。信号检测理论（SDT）可分解这些成分。虽然基于SDT的指标（如AUROC）应用日益广泛，但完整的参数化框架——包括不等方差模型拟合、判断标准估计、z-ROC分析——尚未被应用于作为信号检测器的LLMs。在这项预先注册的研究中，我们将三个LLMs视为执行事实判别任务的观察者，通过168,000次试验测试温度是否发挥类似人类心理物理学中奖惩操纵引发的判断标准偏移作用。关键在于，由于温度不仅改变置信度分配，还会改变生成答案本身，此类比可能失效。我们的结果证实了这种失效现象：温度在提升敏感性（AUC）的同时也改变了判断标准。所有模型均呈现不等方差的证据分布（z-ROC斜率0.52-0.84），其中指令微调模型表现出比基础模型（0.77-0.87）或人类再认记忆（约0.80）更极端的不对称性（0.52-0.63）。SDT分解表明，仅凭校准指标无法区分在敏感性-偏差空间中处于不同位置的模型，这证明完整的参数化框架能提供现有指标无法获得的诊断信息。