Diffusion-based visuomotor policies built on 3D visual representations have achieved strong performance in learning complex robotic skills. However, most existing methods employ an oversized denoising decoder. While increasing model capacity can improve denoising, empirical evidence suggests that it also introduces redundancy and noise in intermediate feature blocks. Crucially, we find that randomly masking backbone features at inference time (without changing training) can improve performance, confirming the presence of task-irrelevant noise in intermediate features. To this end, we propose Variational Regularization (VR), a lightweight module that imposes a timestep-conditioned Gaussian over backbone features and applies a KL-divergence regularizer, forming an adaptive information bottleneck. Extensive experiments on three simulation benchmarks (RoboTwin2.0, Adroit, and MetaWorld) show that, compared to the baseline DP3, our approach improves the success rate by 6.1% on RoboTwin2.0 and by 4.1% on Adroit and MetaWorld, achieving new state-of-the-art results. Real-world experiments further demonstrate that our method performs well in practical deployments. Code will released.
翻译:基于三维视觉表征构建的扩散式视觉运动策略在学习复杂机器人技能方面已取得优异性能。然而,现有方法大多采用过参数化的去噪解码器。虽然增加模型容量可提升去噪效果,但实证研究表明这也会在中间特征块中引入冗余与噪声。关键的是,我们发现推理阶段随机掩码主干特征(不改变训练过程)能够提升性能,这证实了中间特征中存在任务无关噪声。为此,我们提出变分正则化——一种轻量化模块,通过对主干特征施加时间步条件高斯分布并应用KL散度正则化器,构建自适应信息瓶颈。在三个仿真基准(RoboTwin2.0、Adroit与MetaWorld)上的大量实验表明:相较于基线方法DP3,我们的方法在RoboTwin2.0上成功率提升6.1%,在Adroit与MetaWorld上提升4.1%,达到新的最优性能。真实世界实验进一步验证了该方法在实际部署中的良好表现。代码即将开源。