Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. However, existing VLA systems still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We argue that these limitations are closely tied to the structural properties of actions in VLA settings, including the inherent multi-peaked nature of action distributions, the token-based symbolic reasoning of pretrained VLM/VLA backbones, and the effective finite resolution imposed by real-world robotic control. Motivated by these properties, we introduce E0, a tweedie discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. By operating in a discrete action space with a principled diffusion process, E0 naturally aligns with token-based reasoning, supports fine-grained yet executable action control, and avoids the distributional mismatch of masking-based discrete diffusion. We further introduce a spherical viewpoint perturbation augmentation to enhance robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, ManiSkill, and a real-world Franka arm demonstrate that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average.
翻译:视觉-语言-动作(VLA)模型通过整合视觉感知、语言理解与控制生成,为机器人操作提供了统一框架。然而,现有VLA系统仍难以泛化至多样化的任务、场景与相机视角,且常产生粗糙或不稳定的动作。我们认为这些局限性与VLA设置中动作的结构特性密切相关,包括动作分布固有的多峰性质、预训练VLM/VLA骨干网络的基于标记的符号推理能力,以及真实世界机器人控制施加的有效有限分辨率。基于这些特性,我们提出E0——一种特威迪离散扩散框架,将动作生成形式化为对量化动作标记的迭代去噪过程。通过在离散动作空间中运用原理性扩散过程,E0自然契合基于标记的推理机制,支持细粒度且可执行的动作控制,并避免了基于掩蔽的离散扩散中的分布失配问题。我们进一步引入球形视角扰动增强技术,在不增加额外数据的情况下提升对相机位移的鲁棒性。在LIBERO、VLABench、ManiSkill及真实Franka机械臂上的实验表明,E0在14个多样化环境中均取得最先进性能,平均超越强基线方法10.7%。