This study explores the application of self-supervised learning (SSL) to the task of motion forecasting, an area that has not yet been extensively investigated despite the widespread success of SSL in computer vision and natural language processing. To address this gap, we introduce Forecast-MAE, an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task. Our approach includes a novel masking strategy that leverages the strong interconnections between agents' trajectories and road networks, involving complementary masking of agents' future or history trajectories and random masking of lane segments. Our experiments on the challenging Argoverse 2 motion forecasting benchmark show that Forecast-MAE, which utilizes standard Transformer blocks with minimal inductive bias, achieves competitive performance compared to state-of-the-art methods that rely on supervised learning and sophisticated designs. Moreover, it outperforms the previous self-supervised learning method by a significant margin. Code is available at https://github.com/jchengai/forecast-mae.
翻译:本研究探讨了自监督学习(SSL)在运动预测任务中的应用。尽管SSL已在计算机视觉和自然语言处理领域取得广泛成功,但该任务尚未得到充分研究。为填补这一空白,我们提出Forecast-MAE,一种专为运动预测任务自监督学习设计的掩码自编码器框架扩展。该方法包括一种新颖的掩码策略,利用智能体轨迹与道路网络之间的强关联性,对智能体的历史/未来轨迹进行互补掩码,并对车道段进行随机掩码。我们在具有挑战性的Argoverse 2运动预测基准上的实验表明,采用标准Transformer模块(仅含极少归纳偏置)的Forecast-MAE,在性能上可与依赖监督学习与复杂设计的最新方法相媲美。此外,该方法以显著优势超越了先前的自监督学习方法。代码已开源至https://github.com/jchengai/forecast-mae。