ARB4WM: An Adversarial Robustness Benchmark for World Models in Continuous Control

World models are widely used in robotic and agentic engineering control systems due to their ability to learn latent dynamics for planning and decision-making. As these systems are increasingly deployed in safety-critical settings, understanding their robustness under adversarial conditions has become essential. However, existing evaluations lack a unified benchmark for testing adversarial threats across the policy, value, and latent-dynamics levels of world-model agents. To fill this gap, we present ARB4WM, a unified evaluation framework for pre-deployment robustness and risk assessment of world-model agents under visual perturbations. ARB4WM defines five white-box loss objectives across these three levels and studies their effects when combined with single-step or multi-step perturbation strategies and temporal attack modes, including full-frame, half-sequence, and sparse-frame exposure. Specifically, we evaluate four Dreamer-style agents across 20 tasks from MetaWorld and the DeepMind Control Suite under different loss objectives, perturbation strategies, and temporal attack modes. Results show that attacks targeting value estimation, latent representations, and RSSM dynamics can be as damaging as direct policy disruption, and that early or frequent perturbations are especially harmful, while input-level defenses provide limited recovery under adaptive attacks. These findings suggest that safety, risk, and reliability assessment for world models should cover multiple component-oriented attack objectives and temporal exposure protocols rather than relying solely on action-space robustness. Source code is available at https://github.com/zaoanguai/ARB4WM.

翻译：世界模型因其能够学习规划与决策所需的潜在动态，被广泛应用于机器人与智能体工程控制系统。随着此类系统愈发部署在安全关键场景中，理解其在对抗条件下的鲁棒性变得至关重要。然而，现有评估缺乏在策略、价值及潜在动态层面对世界模型智能体进行对抗威胁测试的统一基准。为此，我们提出ARB4WM——一个针对视觉扰动下世界模型智能体部署前鲁棒性与风险评估的统一评估框架。ARB4WM在这三个层面定义了五种白盒损失目标，并研究了它们与单步或多步扰动策略以及包括全帧、半序列和稀疏帧暴露的时间攻击模式相结合时的影响。具体而言，我们评估了四种Dreamer风格智能体在MetaWorld和DeepMind Control套件中20个任务上，面对不同损失目标、扰动策略和时间攻击模式的表现。结果表明，针对价值估计、潜在表征和RSSM动力学的攻击其破坏性可与直接策略破坏相当，早期或频繁的扰动尤其有害，而输入级防御在自适应攻击下恢复能力有限。这些发现表明，世界模型的安全性、风险与可靠性评估应覆盖多种面向组件的攻击目标及时间暴露协议，而非仅依赖于动作空间鲁棒性。源代码见 https://github.com/zaoanguai/ARB4WM。