Robots operating in open, unstructured real-world environments must rely on onboard visual perception while autonomously moving across different locations. Continuous changes in onboard camera viewpoints cause significant visual scale variations in target objects, affecting vision-based motion generation. In this work, we present a stereo multistage spatial attention-based deep predictive learning method for real-time mobile manipulation. The proposed methods extracts task-relevant spatial attention points from stereo images and integrates them with robot states through a hierarchical recurrent architecture for closed-loop action prediction. We evaluate the system on four real-world mobile manipulation tasks using a mobile manipulator, including rigid placement, articulated object manipulation, and deformable object interaction. Experiments under randomized initial positions and visual disturbance conditions demonstrate improved robustness and task success rates compared to representative imitation learning and vision-language-action baselines under identical control settings. The results indicate that structured stereo spatial attention combined with predictive temporal modeling provides an effective solution within the evaluated mobile manipulation scenarios.
翻译:在开放、非结构化的真实环境中运行的机器人必须依赖机载视觉感知,同时在不同位置之间自主移动。机载摄像头视点的持续变化会导致目标物体出现显著的视觉尺度变化,从而影响基于视觉的运动生成。本文提出了一种基于立体多级空间注意力的深度预测学习方法,用于实时移动操作。该方法从立体图像中提取与任务相关的空间注意力点,并通过分层循环架构将其与机器人状态融合,实现闭环动作预测。我们使用移动机械臂在四项真实世界移动操作任务中评估系统性能,包括刚性放置、铰接物体操作和可变形物体交互。在随机初始位置和视觉干扰条件下的实验表明,与具有相同控制设置的代表性模仿学习和视觉-语言-动作基线方法相比,该方法在鲁棒性和任务成功率方面均有提升。结果表明,结构化的立体空间注意力结合预测性时间建模,在所评估的移动操作场景中提供了一种有效的解决方案。