Large language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding misinformation and persuasion do not adequately address. This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental: these systems may be configured, through optimization processes that make them useful, to present characteristics that bypass the cognitive mechanisms humans evolved to evaluate incoming information. The Cognitive Trojan Horse hypothesis draws on Sperber and colleagues' theory of epistemic vigilance -- the parallel cognitive process monitoring communicated information for reasons to doubt -- and proposes that LLM-based systems present 'honest non-signals': genuine characteristics (fluency, helpfulness, apparent disinterest) that fail to carry the information equivalent human characteristics would carry, because in humans these are costly to produce while in LLMs they are computationally trivial. Four mechanisms of potential bypass are identified: processing fluency decoupled from understanding, trust-competence presentation without corresponding stakes, cognitive offloading that delegates evaluation itself to the AI, and optimization dynamics that systematically produce sycophancy. The framework generates testable predictions, including a counterintuitive speculation that cognitively sophisticated users may be more vulnerable to AI-mediated epistemic influence. This reframes AI safety as partly a problem of calibration -- aligning human evaluative responses with the actual epistemic status of AI-generated content -- rather than solely a problem of preventing deception.
翻译:基于大型语言模型的对话式AI系统对人类认知构成了挑战,而当前理解虚假信息和说服的理论框架未能充分应对这一挑战。本文提出,对话式AI带来的重大认知风险可能并非源于不准确性或故意欺骗,而在于更根本的问题:这些系统通过使其具备实用性的优化过程,可能呈现出能够绕过人类评估信息所进化的认知机制的特征。"认知特洛伊木马"假说借鉴了Sperber等人的认知警觉理论——即对交流信息进行并行认知处理以寻找质疑理由的过程——并指出基于LLM的系统呈现了"诚实的非信号":真实特征(流畅性、有用性、表面无私性)未能携带人类特征本应携带的等同信息,因为人类这些特征的产生需要高昂代价,而在LLM中它们仅需微不足道的计算资源。研究识别了四种潜在的绕过机制:脱离理解的处理流畅性、缺乏相应利害关系的信任-能力呈现、将评估本身委托给AI的认知卸载,以及系统性产生谄媚的优化动力学。该框架生成了可检验的预测,包括一个反直觉的推测:认知能力更强的用户可能更容易受到AI介导的认知影响。这重新定义了AI安全,将其部分视为校准问题——使人类评估反应与AI生成内容的实际认知地位相匹配——而不仅仅是防止欺骗的问题。