Robotic manipulation has been widely applied in industrial scenarios. Compared with single-arm manipulation, bimanual manipulation is equipped with multiple cameras to capture information from different viewpoints. However, existing multi-view policies encode each view independently or fuse view features shallowly, resulting in limited sharing semantic perception and unreliable spatial awareness. In this paper, we propose \textbf{MV-Actor}, a multi-view perception framework that builds a unified semantic-spatial representation for bimanual manipulation. First, MV-Actor performs Multi-view Semantic Interaction to share semantic perception across views. Then it uses Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features and acquire reliable spatial awareness. Finally, a Guided Metric Depth Repair module refines degraded sensor depth to provide more reliable metric anchors under consumer-grade depth noise. In simulation experiments conducted on the PerAct2 bimanual benchmark, MV-Actor achieves a state-of-the-art average success rate of 87.8\%. In real-world evaluations with more frequent viewpoint changes and unstable consumer-grade depth, MV-Actor outperforms both RGB and RGB-D baselines, further demonstrating the benefit of sharing semantic perception and reliable spatial awareness for bimanual manipulation.
翻译:机器人操作已广泛应用于工业场景。与单臂操作相比,双臂操作配备多台相机以捕获不同视角的信息。然而,现有多种视角策略独立编码每个视角或浅层融合视角特征,导致语义感知共享受限且空间感知不可靠。本文提出**MV-Actor**——一种为双臂操作构建统一语义-空间表征的多视角感知框架。首先,MV-Actor通过多视角语义交互实现跨视角的语义感知共享;其次,利用语义-空间令牌交互将视觉语义与前向重建模型特征进行对齐,从而获得可靠的空间感知;最后,采用引导式度量深度修复模块,在消费级深度噪声下对退化传感器深度进行优化,以提供更可靠的度量锚点。在PerAct2双臂基准的仿真实验中,MV-Actor取得了87.8%的平均成功率,达到当前最优水平。在视角变化更频繁、消费级深度不稳定的真实场景评估中,MV-Actor的RGB与RGB-D基线均表现出显著优势,进一步验证了语义感知共享与可靠空间感知对双臂操作的有效性。