As the cornerstone of artificial intelligence, machine perception confronts a fundamental threat posed by adversarial illusions. These adversarial attacks manifest in two primary forms: deductive illusion, where specific stimuli are crafted based on the victim model's general decision logic, and inductive illusion, where the victim model's general decision logic is shaped by specific stimuli. The former exploits the model's decision boundaries to create a stimulus that, when applied, interferes with its decision-making process. The latter reinforces a conditioned reflex in the model, embedding a backdoor during its learning phase that, when triggered by a stimulus, causes aberrant behaviours. The multifaceted nature of adversarial illusions calls for a unified defence framework, addressing vulnerabilities across various forms of attack. In this study, we propose a disillusion paradigm based on the concept of an imitation game. At the heart of the imitation game lies a multimodal generative agent, steered by chain-of-thought reasoning, which observes, internalises and reconstructs the semantic essence of a sample, liberated from the classic pursuit of reversing the sample to its original state. As a proof of concept, we conduct experimental simulations using a multimodal generative dialogue agent and evaluates the methodology under a variety of attack scenarios.
翻译:作为人工智能的基石,机器感知面临着对抗性幻觉带来的根本性威胁。这些对抗性攻击主要表现为两种形式:演绎性幻觉,即基于受害模型的通用决策逻辑精心设计特定刺激;以及归纳性幻觉,即受害模型的通用决策逻辑被特定刺激所塑造。前者利用模型的决策边界创建一种刺激,当施加该刺激时,会干扰其决策过程。后者则在模型中强化一种条件反射,在其学习阶段嵌入后门,当被特定刺激触发时,会导致异常行为。对抗性幻觉的多面性要求一个统一的防御框架,以应对各种形式攻击中的脆弱性。在本研究中,我们提出了一种基于模仿游戏概念的祛魅范式。该模仿游戏的核心是一个由思维链推理引导的多模态生成式智能体,它观察、内化并重构样本的语义本质,从而摆脱了将样本还原至原始状态的经典追求。作为概念验证,我们使用一个多模态生成式对话智能体进行实验模拟,并在多种攻击场景下评估该方法。