基于三维动态场景图的长时人体轨迹预测 (Long-Term Human Trajectory Prediction using 3D Dynamic Scene Graphs)

We present a novel approach for long-term human trajectory prediction in indoor human-centric environments, which is essential for long-horizon robot planning in these environments. State-of-the-art human trajectory prediction methods are limited by their focus on collision avoidance and short-term planning, and their inability to model complex interactions of humans with the environment. In contrast, our approach overcomes these limitations by predicting sequences of human interactions with the environment and using this information to guide trajectory predictions over a horizon of up to 60s. We leverage Large Language Models (LLMs) to predict interactions with the environment by conditioning the LLM prediction on rich contextual information about the scene. This information is given as a 3D Dynamic Scene Graph that encodes the geometry, semantics, and traversability of the environment into a hierarchical representation. We then ground these interaction sequences into multi-modal spatio-temporal distributions over human positions using a probabilistic approach based on continuous-time Markov Chains. To evaluate our approach, we introduce a new semi-synthetic dataset of long-term human trajectories in complex indoor environments, which also includes annotations of human-object interactions. We show in thorough experimental evaluations that our approach achieves a 54% lower average negative log-likelihood and a 26.5% lower Best-of-20 displacement error compared to the best non-privileged (i.e., evaluated in a zero-shot fashion on the dataset) baselines for a time horizon of 60s.

翻译：我们提出了一种在室内人本环境中进行长时人体轨迹预测的新方法，这对该类环境中的长时程机器人规划至关重要。现有最先进的人体轨迹预测方法受限于其对避碰和短期规划的侧重，且无法建模人与环境的复杂交互。相比之下，我们的方法通过预测人与环境交互的序列，并利用该信息指导长达60秒时程的轨迹预测，从而克服了这些局限。我们利用大型语言模型（LLMs）来预测与环境的交互，其方式是将LLM的预测建立在关于场景的丰富上下文信息之上。该信息以三维动态场景图的形式提供，该图将环境的几何、语义和可通行性编码为一种层次化表示。随后，我们采用一种基于连续时间马尔可夫链的概率方法，将这些交互序列落实到关于人体位置的多模态时空分布上。为了评估我们的方法，我们引入了一个新的半合成数据集，包含复杂室内环境中的长时人体轨迹，并标注了人物交互信息。在全面的实验评估中，我们表明，在60秒的预测时程上，与最佳非特权（即在数据集上以零样本方式评估的）基线方法相比，我们的方法实现了平均负对数似然降低54%，以及最佳20次位移误差降低26.5%。