Motion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.
翻译:运动预测对于自动驾驶汽车在复杂交通环境中安全运行至关重要。提取交通元素间有效的时空关系是实现精准预测的关键。受预训练大语言模型成功实践的启发,本文提出SEPT建模框架,利用自监督学习为复杂交通场景建立强大的时空理解能力。具体而言,我们的方法在包含智能体轨迹与道路网络的场景输入上设计了三项掩码重建建模任务,通过预训练场景编码器捕获轨迹内的运动学特征、道路网络的空间结构以及道路与智能体间的交互关系。预训练编码器随后在下游预测任务中微调。大量实验证明,SEPT无需精巧的架构设计或人工特征工程,即在Argoverse 1和Argoverse 2运动预测基准测试中达到最先进性能,在所有主要指标上显著超越先前方法。