As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.
翻译:随着大语言模型获得更强的推理能力,欺骗行为已成为愈发严峻的安全隐患。现有欺骗监控器要么对可见文本进行评分,要么从表征向量中提取标量探测分数,难以提供可检验的证据来解释为何某条回复可疑。我们提出STATEWITNESS,一种面向欺骗审计的激活解释器。该独立解码器读取目标模型的隐藏状态后,通过自然语言查询回答或生成结构化报告对其进行阐释。我们在七个欺骗数据集上针对两个目标推理型大语言模型评估了STATEWITNESS。在相同评估协议下,STATEWITNESS的平均AUROC达到0.916,相较于最佳黑箱文本监控器相对提升11.6%,相较于最佳激活探针基线相对提升25.0%。当与现有监控器结合使用时,STATEWITNESS在简单阈值集成中减少了漏检的欺骗样本。除标量检测外,该解码器还可返回查询级答案、模式报告及供人工审查的词元级或句子级证据痕迹。我们将此接口视为构建更广泛可解释性与对齐工具的潜在基础模块。