We present a novel approach for long-term human trajectory prediction, which is essential for long-horizon robot planning in human-populated environments. State-of-the-art human trajectory prediction methods are limited by their focus on collision avoidance and short-term planning, and their inability to model complex interactions of humans with the environment. In contrast, our approach overcomes these limitations by predicting sequences of human interactions with the environment and using this information to guide trajectory predictions over a horizon of up to 60s. We leverage Large Language Models (LLMs) to predict interactions with the environment by conditioning the LLM prediction on rich contextual information about the scene. This information is given as a 3D Dynamic Scene Graph that encodes the geometry, semantics, and traversability of the environment into a hierarchical representation. We then ground these interaction sequences into multi-modal spatio-temporal distributions over human positions using a probabilistic approach based on continuous-time Markov Chains. To evaluate our approach, we introduce a new semi-synthetic dataset of long-term human trajectories in complex indoor environments, which also includes annotations of human-object interactions. We show in thorough experimental evaluations that our approach achieves a 54% lower average negative log-likelihood (NLL) and a 26.5% lower Best-of-20 displacement error compared to the best non-privileged baselines for a time horizon of 60s.
翻译:我们提出了一种新颖的长期人类轨迹预测方法,这对于在人类密集环境中进行长时域机器人规划至关重要。现有最先进的人类轨迹预测方法受限于其聚焦于避障和短期规划,且无法建模人类与环境的复杂交互。相比之下,我们的方法通过预测人类与环境交互序列,并利用这些信息指导长达60秒时域的轨迹预测,克服了这些局限性。我们利用大语言模型(LLMs)预测环境交互,通过将场景的丰富上下文信息输入LLM条件预测模型。这些上下文信息以3D动态场景图形式呈现,该图将环境的几何、语义和可通行性编码为分层表示。随后,我们基于连续时间马尔可夫链的概率方法,将这些交互序列转化为人类位置的多模态时空分布。为评估我们的方法,我们引入了一个新的半合成复杂室内环境长期人类轨迹数据集,该数据集同时包含人-物交互标注。详尽的实验评估表明,在60秒时域内,相比最优的非特权基线方法,我们的方法将平均负对数似然(NLL)降低了54%,并将Best-of-20位移误差降低了26.5%。