视觉-本体感知策略在机器人操作中何时会失效？ (When would Vision-Proprioception Policies Fail in Robotic Manipulation?)

Proprioceptive information is critical for precise servo control by providing real-time robotic states. Its collaboration with vision is highly expected to enhance performances of the manipulation policy in complex tasks. However, recent studies have reported inconsistent observations on the generalization of vision-proprioception policies. In this work, we investigate this by conducting temporally controlled experiments. We found that during task sub-phases that robot's motion transitions, which require target localization, the vision modality of the vision-proprioception policy plays a limited role. Further analysis reveals that the policy naturally gravitates toward concise proprioceptive signals that offer faster loss reduction when training, thereby dominating the optimization and suppressing the learning of the visual modality during motion-transition phases. To alleviate this, we propose the Gradient Adjustment with Phase-guidance (GAP) algorithm that adaptively modulates the optimization of proprioception, enabling dynamic collaboration within the vision-proprioception policy. Specifically, we leverage proprioception to capture robotic states and estimate the probability of each timestep in the trajectory belonging to motion-transition phases. During policy learning, we apply fine-grained adjustment that reduces the magnitude of proprioception's gradient based on estimated probabilities, leading to robust and generalizable vision-proprioception policies. The comprehensive experiments demonstrate GAP is applicable in both simulated and real-world environments, across one-arm and dual-arm setups, and compatible with both conventional and Vision-Language-Action models. We believe this work can offer valuable insights into the development of vision-proprioception policies in robotic manipulation.

翻译：本体感知信息通过提供实时机器人状态，对精确伺服控制至关重要。其与视觉的协同作用被寄予厚望，以提升复杂任务中操作策略的性能。然而，近期研究关于视觉-本体感知策略泛化能力的观察结果并不一致。在本工作中，我们通过开展时序受控实验对此进行了探究。我们发现，在机器人运动发生转换、需要进行目标定位的任务子阶段，视觉-本体感知策略中的视觉模态作用有限。进一步分析揭示，在训练过程中，策略会自然地倾向于能带来更快损失下降的简洁本体感知信号，从而主导优化过程，并抑制了在运动转换阶段对视觉模态的学习。为缓解此问题，我们提出了基于阶段引导的梯度调整（Gradient Adjustment with Phase-guidance, GAP）算法，该算法自适应地调节本体感知的优化，实现视觉-本体感知策略内部的动态协同。具体而言，我们利用本体感知捕获机器人状态，并估计轨迹中每个时间步属于运动转换阶段的概率。在策略学习过程中，我们应用细粒度调整，基于估计的概率降低本体感知梯度的大小，从而获得鲁棒且可泛化的视觉-本体感知策略。全面的实验表明，GAP算法在仿真和真实环境中均适用，适用于单臂和双臂设置，并且与传统的以及视觉-语言-动作（Vision-Language-Action）模型兼容。我们相信这项工作能为机器人操作中视觉-本体感知策略的开发提供有价值的见解。