Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training-only depth distillation from a pretrained depth teacher to supervise the depth-distribution head, providing perception front-end with geometry-aware priors. On RoboTwin 2.0 under domain-randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real-robot experiments further demonstrate reliable sim-to-real transfer and consistent improvements from depth distillation. Project website: https://peafowlvla.github.io/.
翻译:在杂乱场景中进行双手操作需要策略能够在遮挡、视角和场景变化下保持稳定。现有的视觉-语言-动作模型往往泛化能力不足,原因在于:(i) 多视角特征通过视角无关的令牌拼接进行融合,导致对三维空间的一致性理解较弱;(ii) 语言信息作为全局条件注入,导致指令的语义基础较为粗糙。本文提出PEAfowl,一种用于双手操作的感知增强多视角视觉-语言-动作策略。在空间推理方面,PEAfowl预测每个令牌的深度分布,执行可微分三维提升,并聚合局部跨视角邻域信息,以形成几何基础牢固、跨视角一致的表征。在指令基础方面,我们提出用基于Perceiver架构的、在冻结的CLIP视觉特征上进行文本感知读取的机制,替代全局条件注入,从而实现迭代式的证据积累。为了在不增加推理开销的情况下克服商用深度传感器噪声大且不完整的问题,我们应用仅用于训练的深度蒸馏方法,利用预训练的深度教师模型监督深度分布预测头,为感知前端提供几何感知先验。在领域随机化的RoboTwin 2.0环境中,PEAfowl将最强基线的成功率提升了23.0个百分点,真实机器人实验进一步证明了其可靠的仿真到现实迁移能力以及深度蒸馏带来的持续性能改进。项目网站:https://peafowlvla.github.io/。