Recent advances in legged locomotion learning are still dominated by the utilization of geometric representations of the environment, limiting the robot's capability to respond to higher-level semantics such as human instructions. To address this limitation, we propose a novel approach that integrates high-level commonsense reasoning from foundation models into the process of legged locomotion adaptation. Specifically, our method utilizes a pre-trained large language model to synthesize an instruction-grounded skill database tailored for legged robots. A pre-trained vision-language model is employed to extract high-level environmental semantics and ground them within the skill database, enabling real-time skill advisories for the robot. To facilitate versatile skill control, we train a style-conditioned policy capable of generating diverse and robust locomotion skills with high fidelity to specified styles. To the best of our knowledge, this is the first work to demonstrate real-time adaptation of legged locomotion using high-level reasoning from environmental semantics and instructions with instruction-following accuracy of up to 87% without the need for online query to on-the-cloud foundation models.
翻译:当前足式运动学习领域的研究进展仍主要依赖于环境几何表征的运用,这限制了机器人响应人类指令等高层语义信息的能力。为解决这一局限性,我们提出一种创新方法,将基础模型的高层常识推理能力融入足式运动适应过程。具体而言,该方法利用预训练大语言模型合成面向足式机器人的指令驱动技能数据库,并采用预训练视觉-语言模型提取环境高层语义信息,将其与技能数据库进行语义对齐,从而为机器人提供实时技能决策支持。为实现多模态技能控制,我们训练了风格条件策略网络,该网络能够生成多样化且鲁棒的运动技能,并高度忠实于指定风格。据我们所知,本研究首次实现了基于环境语义与指令的高层推理进行足式运动实时适应,在无需云端基础模型在线查询的情况下,指令跟随准确率最高可达87%。