Virtual environments provide a rich and controlled setting for collecting detailed data on human behavior, offering unique opportunities for predicting human trajectories in dynamic scenes. However, most existing approaches have overlooked the potential of these environments, focusing instead on static contexts without considering userspecific factors. Employing the CREATTIVE3D dataset, our work models trajectories recorded in virtual reality (VR) scenes for diverse situations including road-crossing tasks with user interactions and simulated visual impairments. We propose Diverse Context VR Human Motion Prediction (DiVR), a cross-modal transformer based on the Perceiver architecture that integrates both static and dynamic scene context using a heterogeneous graph convolution network. We conduct extensive experiments comparing DiVR against existing architectures including MLP, LSTM, and transformers with gaze and point cloud context. Additionally, we also stress test our model's generalizability across different users, tasks, and scenes. Results show that DiVR achieves higher accuracy and adaptability compared to other models and to static graphs. This work highlights the advantages of using VR datasets for context-aware human trajectory modeling, with potential applications in enhancing user experiences in the metaverse. Our source code is publicly available at https://gitlab.inria.fr/ffrancog/creattive3d-divr-model.
翻译:虚拟环境为收集人类行为的详细数据提供了丰富且受控的场景,为动态场景中的人类轨迹预测提供了独特机遇。然而,现有方法大多忽视了这些环境的潜力,主要关注静态上下文而未考虑用户特定因素。本研究利用CREATTIVE3D数据集,对虚拟现实(VR)场景中记录的多样化情境轨迹进行建模,包括含用户交互的过马路任务及模拟视觉障碍场景。我们提出多样化上下文VR人体运动预测模型(DiVR),这是一种基于Perceiver架构的跨模态Transformer,通过异构图卷积网络整合静态与动态场景上下文。我们进行了大量实验,将DiVR与现有架构(包括MLP、LSTM及结合视线点云上下文的Transformer)进行比较,并针对模型在不同用户、任务和场景间的泛化能力进行了压力测试。结果表明,相较于其他模型及静态图方法,DiVR实现了更高的预测精度与适应性。本研究凸显了利用VR数据集进行上下文感知人体轨迹建模的优势,在提升元宇宙用户体验方面具有潜在应用价值。源代码已公开于https://gitlab.inria.fr/ffrancog/creattive3d-divr-model。