Human-robot collaboration, in which the robot intelligently assists the human with the upcoming task, is an appealing objective. To achieve this goal, the agent needs to be equipped with a fundamental collaborative navigation ability, where the agent should reason human intention by observing human activities and then navigate to the human's intended destination in advance of the human. However, this vital ability has not been well studied in previous literature. To fill this gap, we propose a collaborative navigation (CoNav) benchmark. Our CoNav tackles the critical challenge of constructing a 3D navigation environment with realistic and diverse human activities. To achieve this, we design a novel LLM-based humanoid animation generation framework, which is conditioned on both text descriptions and environmental context. The generated humanoid trajectory obeys the environmental context and can be easily integrated into popular simulators. We empirically find that the existing navigation methods struggle in CoNav task since they neglect the perception of human intention. To solve this problem, we propose an intention-aware agent for reasoning both long-term and short-term human intention. The agent predicts navigation action based on the predicted intention and panoramic observation. The emergent agent behavior including observing humans, avoiding human collision, and navigation reveals the efficiency of the proposed datasets and agents.
翻译:人机协作——即机器人智能地协助人类完成即将执行的任务——是一个极具吸引力的目标。为实现这一目标,智能体需具备基础的协作导航能力,即通过观察人类活动推断其意图,并先于人类抵达其预期目的地。然而,这一关键能力在现有研究中尚未得到充分探索。为填补这一空白,我们提出了协作导航(CoNav)基准测试。CoNav 的核心挑战在于构建具有真实性与多样性人类活动的三维导航环境。为此,我们设计了一种基于大语言模型的新型人体动画生成框架,该框架以文本描述和环境上下文为条件。所生成的人体轨迹符合环境约束,并可便捷地集成至主流仿真平台中。实证研究表明,现有导航方法因忽视对人类意图的感知而在 CoNav 任务中表现不佳。为解决此问题,我们提出了一种意图感知智能体,能够同时推断人类的长时与短时意图。该智能体基于预测的意图与全景观测结果规划导航动作。实验中涌现的智能体行为——包括观察人类、规避碰撞与自主导航——验证了所提数据集与智能体的有效性。