认知控制架构（CCA）：面向稳健对齐AI智能体的全生命周期监督框架 (Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents)

Autonomous Large Language Model (LLM) agents exhibit significant vulnerability to Indirect Prompt Injection (IPI) attacks. These attacks hijack agent behavior by polluting external information sources, exploiting fundamental trade-offs between security and functionality in existing defense mechanisms. This leads to malicious and unauthorized tool invocations, diverting agents from their original objectives. The success of complex IPIs reveals a deeper systemic fragility: while current defenses demonstrate some effectiveness, most defense architectures are inherently fragmented. Consequently, they fail to provide full integrity assurance across the entire task execution pipeline, forcing unacceptable multi-dimensional compromises among security, functionality, and efficiency. Our method is predicated on a core insight: no matter how subtle an IPI attack, its pursuit of a malicious objective will ultimately manifest as a detectable deviation in the action trajectory, distinct from the expected legitimate plan. Based on this, we propose the Cognitive Control Architecture (CCA), a holistic framework achieving full-lifecycle cognitive supervision. CCA constructs an efficient, dual-layered defense system through two synergistic pillars: (i) proactive and preemptive control-flow and data-flow integrity enforcement via a pre-generated "Intent Graph"; and (ii) an innovative "Tiered Adjudicator" that, upon deviation detection, initiates deep reasoning based on multi-dimensional scoring, specifically designed to counter complex conditional attacks. Experiments on the AgentDojo benchmark substantiate that CCA not only effectively withstands sophisticated attacks that challenge other advanced defense methods but also achieves uncompromised security with notable efficiency and robustness, thereby reconciling the aforementioned multi-dimensional trade-off.

翻译：自主大型语言模型（LLM）智能体对间接提示注入（IPI）攻击表现出显著的脆弱性。此类攻击通过污染外部信息源来劫持智能体行为，利用了现有防御机制中安全性与功能性的根本权衡。这导致恶意且未经授权的工具调用，使智能体偏离其原始目标。复杂IPI攻击的成功揭示了一个更深层的系统性脆弱性：尽管当前防御措施展现出一定有效性，但大多数防御架构本质上是碎片化的。因此，它们无法在整个任务执行流水线中提供完整的完整性保证，迫使系统在安全性、功能性和效率之间做出不可接受的多维妥协。我们的方法基于一个核心洞见：无论IPI攻击多么隐蔽，其对恶意目标的追求最终都会表现为行动轨迹中可检测的偏离，这种偏离与预期的合法计划截然不同。基于此，我们提出了认知控制架构（CCA），一个实现全生命周期认知监督的整体框架。CCA通过两个协同支柱构建了一个高效的双层防御系统：（i）通过预生成的“意图图”实现主动与先发制人的控制流与数据流完整性保障；（ii）创新的“分层裁决器”，在检测到偏离时启动基于多维评分的深度推理，专门设计用于应对复杂的条件式攻击。在AgentDojo基准测试上的实验证明，CCA不仅能有效抵御挑战其他先进防御方法的复杂攻击，还能以显著的效率和鲁棒性实现无妥协的安全性，从而调和上述多维权衡。