Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g., "apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.
翻译:内省是一种基础的认知能力,但其机制尚未得到充分理解。近期研究表明,AI模型能够进行内省。我们研究了它们的内省机制,首先广泛复现了Lindsey等人(2025)在大型开源模型中提出的思想注入检测范式。我们发现这些模型通过两种可分离的机制检测注入的表征:(i)概率匹配(从感知到的提示异常进行推理)与(ii)对内部状态的直接访问。直接访问机制与内容无关:模型能检测到异常发生,但无法可靠识别其语义内容。我们研究的两种模型类别会虚构出高频且具体的注入概念(例如“苹果”);对这些模型而言,正确猜测概念通常需要显著更多的词元。这种与内容无关的内省机制与哲学和心理学的主流理论相一致。