Humanoid robot manipulation is a crucial research area for executing diverse human-level tasks, involving high-level semantic reasoning and low-level action generation. However, precise scene understanding and sample-efficient learning from human demonstrations remain critical challenges, severely hindering the applicability and generalizability of existing frameworks. This paper presents a novel RGMP-S, Recurrent Geometric-prior Multimodal Policy with Spiking features, facilitating both high-level skill reasoning and data-efficient motion synthesis. To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases to enable precise 3D scene understanding within the vision-language model. Specifically, we construct a Long-horizon Geometric Prior Skill Selector that effectively aligns the semantic instructions with spatial constraints, ultimately achieving robust generalization in unseen environments. For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network. We parameterize robot-object interactions via recursive spiking for spatiotemporal consistency, fully distilling long-horizon dynamic features while mitigating the overfitting issue in sparse demonstration scenarios. Extensive experiments are conducted across the Maniskill simulation benchmark and three heterogeneous real-world robotic systems, encompassing a custom-developed humanoid, a desktop manipulator, and a commercial robotic platform. Empirical results substantiate the superiority of our method over state-of-the-art baselines and validate the efficacy of the proposed modules in diverse generalization scenarios. To facilitate reproducibility, the source code and video demonstrations are publicly available at https://github.com/xtli12/RGMP-S.git.
翻译:人形机器人操控是执行多样化人类水平任务的关键研究领域,涉及高层语义推理与低层动作生成。然而,精确的场景理解与基于人类演示的样本高效学习仍是核心挑战,严重制约了现有框架的适用性与泛化能力。本文提出了一种新颖的RGMP-S模型——基于循环几何先验与脉冲特征的多模态策略,旨在同时实现高层技能推理与数据高效的运动合成。为使高层推理扎根于物理现实,我们利用轻量级二维几何归纳偏置,在视觉语言模型中实现精确的三维场景理解。具体而言,我们构建了一个长时程几何先验技能选择器,能有效对齐语义指令与空间约束,最终在未见环境中实现鲁棒的泛化。针对机器人动作生成中的数据效率问题,我们引入了递归自适应脉冲网络。通过递归脉冲参数化机器人-物体交互以保持时空一致性,充分提取长时程动态特征,同时缓解稀疏演示场景中的过拟合问题。我们在Maniskill仿真基准与三种异构真实世界机器人系统(包括自主研发的人形机器人、桌面机械臂及商用机器人平台)上进行了广泛实验。实证结果证实了本方法相对于前沿基线的优越性,并验证了所提模块在多样化泛化场景中的有效性。为促进可复现性,源代码与视频演示已公开于https://github.com/xtli12/RGMP-S.git。