Achieving generalizable manipulation in unconstrained environments requires the robot to proactively resolve information uncertainty, i.e., the capability of active perception. However, existing methods are often confined in limited types of sensing behaviors, restricting their applicability to complex environments. In this work, we formalize active perception as a non-Markovian process driven by information gain and decision branching, providing a structured categorization of visual active perception paradigms. Building on this perspective, we introduce CoMe-VLA, a cognitive and memory-aware vision-language-action (VLA) framework that leverages large-scale human egocentric data to learn versatile exploration and manipulation priors. Our framework integrates a cognitive auxiliary head for autonomous sub-task transitions and a dual-track memory system to maintain consistent self and environmental awareness by fusing proprioceptive and visual temporal contexts. By aligning human and robot hand-eye coordination behaviors in a unified egocentric action space, we train the model progressively in three stages. Extensive experiments on a wheel-based humanoid have demonstrated strong robustness and adaptability of our proposed method across diverse long-horizon tasks spanning multiple active perception scenarios.
翻译:在非结构化环境中实现泛化性操作,要求机器人能够主动解决信息不确定性,即具备主动感知能力。然而,现有方法通常局限于有限的感知行为类型,限制了其在复杂环境中的适用性。本工作将主动感知形式化为一个由信息增益和决策分支驱动的非马尔可夫过程,从而为视觉主动感知范式提供了一个结构化的分类。基于这一视角,我们提出了CoMe-VLA,这是一个认知与记忆感知的视觉-语言-动作框架,它利用大规模人类第一人称数据来学习通用的探索与操作先验。我们的框架集成了一个用于自主子任务转换的认知辅助头,以及一个双轨记忆系统,该系统通过融合本体感觉和视觉时序上下文,来维持一致的自体与环境感知。通过将人类与机器人的手眼协调行为对齐到一个统一的第一人称动作空间中,我们分三个阶段逐步训练模型。在轮式人形机器人上进行的大量实验表明,我们提出的方法在跨越多种主动感知场景的多样化长时程任务中,均表现出强大的鲁棒性和适应性。