This paper introduces CognitiveDog, a pioneering development of quadruped robot with Large Multi-modal Model (LMM) that is capable of not only communicating with humans verbally but also physically interacting with the environment through object manipulation. The system was realized on Unitree Go1 robot-dog equipped with a custom gripper and demonstrated autonomous decision-making capabilities, independently determining the most appropriate actions and interactions with various objects to fulfill user-defined tasks. These tasks do not necessarily include direct instructions, challenging the robot to comprehend and execute them based on natural language input and environmental cues. The paper delves into the intricacies of this system, dataset characteristics, and the software architecture. Key to this development is the robot's proficiency in navigating space using Visual-SLAM, effectively manipulating and transporting objects, and providing insightful natural language commentary during task execution. Experimental results highlight the robot's advanced task comprehension and adaptability, underscoring its potential in real-world applications. The dataset used to fine-tune the robot-dog behavior generation model is provided at the following link: huggingface.co/datasets/ArtemLykov/CognitiveDog_dataset
翻译:本文介绍了CognitiveDog,一种具有大型多模态模型(LMM)的四足机器人开创性成果,它不仅能够与人类进行语言交流,还能通过物体操作与环境进行物理交互。该系统在配备定制夹爪的宇树Go1机器狗上实现,展示了自主决策能力,能够独立确定最合适的动作以及与各种物体的交互方式,以完成用户定义的任务。这些任务不一定包含直接指令,从而要求机器人基于自然语言输入和环境线索理解并执行任务。本文深入探讨了该系统的复杂性、数据集特征及软件架构。这一发展的关键在于机器人利用视觉SLAM进行空间导航、有效操作和运输物体,并在任务执行过程中提供富有洞察力的自然语言评论的能力。实验结果突显了机器人高级的任务理解能力和适应性,展示了其在实际应用中的潜力。用于微调机器狗行为生成模型的数据集可通过以下链接获取:huggingface.co/datasets/ArtemLykov/CognitiveDog_dataset