A robot in a human-centric environment needs to account for the human's intent and future motion in its task and motion planning to ensure safe and effective operation. This requires symbolic reasoning about probable future actions and the ability to tie these actions to specific locations in the physical environment. While one can train behavioral models capable of predicting human motion from past activities, this approach requires large amounts of data to achieve acceptable long-horizon predictions. More importantly, the resulting models are constrained to specific data formats and modalities. Moreover, connecting predictions from such models to the environment at hand to ensure the applicability of these predictions is an unsolved problem. We present a system that utilizes a Large Language Model (LLM) to infer a human's next actions from a range of modalities without fine-tuning. A novel aspect of our system that is critical to robotics applications is that it links the predicted actions to specific locations in a semantic map of the environment. Our method leverages the fact that LLMs, trained on a vast corpus of text describing typical human behaviors, encode substantial world knowledge, including probable sequences of human actions and activities. We demonstrate how these localized activity predictions can be incorporated in a human-aware task planner for an assistive robot to reduce the occurrences of undesirable human-robot interactions by 29.2% on average.
翻译:在人类为中心的环境中,机器人需在任务与运动规划中考虑人类意图及未来运动,以确保安全高效运行。这要求机器人既具备关于可能未来动作的符号推理能力,又能将这些动作与物理环境中的具体位置相联系。尽管可以训练基于过往活动预测人类运动的行为模型,但此类方法需要大量数据才能实现可接受的长期预测。更重要的是,所得模型受限于特定数据格式与模态。此外,如何将此类模型的预测结果与当前环境相衔接以确保其适用性,仍是一个未解难题。本文提出一种无需微调即可从多种模态推断人类下一步动作的大语言模型(LLM)系统。本系统对机器人应用至关重要的新颖之处在于:能将预测动作与环境语义地图中的具体位置相关联。我们方法的核心在于,基于海量描述典型人类行为文本语料训练的LLM,已编码了包括人类动作与活动合理序列在内的丰富世界知识。实验证明,将这些定位化活动预测整合至辅助机器人的行人感知任务规划器中,可使不良人机交互事件平均减少29.2%。