Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible-text inspection. In this paper, we study whether latent states can carry attack-associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework that reactivates attack-induced effects through latent interventions without reusing adversarial text. Extensive experiments show that the resulting latent-only attacks can substantially degrade task performance in clean executions, especially when applied to inter-agent KV-cache handoffs rather than local hidden states. Further control analyses indicate that this degradation cannot be reduced to arbitrary perturbations or invalid generation. Overall, our findings suggest that latent-based collaboration does not remove attack risk. It shifts part of the risk into less observable execution states, calling for safeguards beyond visible-text inspection.
翻译:基于潜在表示的多智能体系统用隐藏表征替换了部分智能体间的显式通信,为高效灵活的智能体协作提供了新方向。然而,将协调过程移入潜在空间,也可能使攻击超出可视文本检测的范围。本文研究潜在状态能否携带在干净执行中仍然有效的攻击相关信息。为探究这一问题,我们提出一种潜在攻击框架,通过潜在干预重新激活攻击诱导的效应,而无需重复使用对抗性文本。大量实验表明,由此产生的纯潜在攻击在干净执行中能显著降低任务性能,尤其是当应用于智能体间KV缓存传递而非局部隐藏状态时。进一步的控制分析表明,这种性能下降不能简化为任意扰动或无效生成。总体而言,我们的发现表明,基于潜在表示的协作并未消除攻击风险,而是将部分风险转移至更不易观察的执行状态中,这要求我们采取超越可视文本检测的安全防护措施。