Motion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.
翻译:运动预测对于自动驾驶汽车在复杂交通环境中安全运行至关重要。提取交通元素间有效的时空关系是准确预测的关键。受预训练大语言模型成功实践的启发,本文提出SEPT这一建模框架,利用自监督学习为复杂交通场景发展强大的时空理解能力。具体而言,我们的方法涉及对包括智能体轨迹和道路网络在内的场景输入执行三项掩码重建建模任务,通过预训练场景编码器以捕获轨迹内的运动学特性、道路网络的空间结构以及道路与智能体间的交互关系。随后,预训练编码器在下游预测任务上进行微调。大量实验表明,SEPT无需精巧的架构设计或手工特征工程,即在Argoverse 1和Argoverse 2运动预测基准上取得了最先进的性能,在所有主要指标上大幅超越先前方法。