Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method, we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art.
翻译:现有许多运动预测方法依赖符号化感知输出来生成智能体轨迹,例如边界框、道路图信息与交通灯。这种符号化表示是对现实世界的高阶抽象,可能导致运动预测模型易受感知错误影响(如未能检测到开放词汇障碍物),同时遗漏场景上下文中的关键信息(如不良路况)。另一种范式是基于原始传感器的端到端学习,但该方法缺乏可解释性,且需要显著更多的训练资源。本文提出将视觉世界标记化为一组紧凑的场景元素,并利用预训练图像基础模型与激光雷达神经网络,以开放词汇的方式编码所有场景元素。图像基础模型使场景令牌能够编码开放世界的通用知识,而激光雷达神经网络则编码几何信息。所提出的表示方法能以数百个令牌高效编码多帧多模态观测,并兼容大多数基于Transformer的架构。为评估方法,我们在Waymo开放运动数据集中补充了相机嵌入。在Waymo开放运动数据集上的实验表明,本文方法相较于现有最优方法取得了显著性能提升。