We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% -> 39.2%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.62 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.
翻译:我们揭示了Qwen 32B模型中存在一种潜在的自省能力,证明该模型能够检测到先前上下文中是否被注入了概念,并能识别具体注入了何种概念。虽然模型在采样输出中否认注入行为,但通过logit lens分析可在残差流中观察到清晰的检测信号,这些信号在最终层有所衰减。此外,通过向模型提供关于AI自省机制的准确提示信息,能够显著增强该效应:模型对概念注入的敏感度大幅提升(从0.3%增至39.2%),而误报率仅增加0.6%。同时,九个注入概念与恢复概念之间的互信息从0.62比特上升至1.05比特,排除了通用噪声解释的可能性。我们的研究结果表明,模型可能具备容易被忽视的惊人自省与调控感知能力,这对潜在推理与安全性研究具有重要启示。