AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving

As an essential task in autonomous driving (AD), motion prediction aims to predict the future states of surround objects for navigation. One natural solution is to estimate the position of other agents in a step-by-step manner where each predicted time-step is conditioned on both observed time-steps and previously predicted time-steps, i.e., autoregressive prediction. Pioneering works like SocialLSTM and MFP design their decoders based on this intuition. However, almost all state-of-the-art works assume that all predicted time-steps are independent conditioned on observed time-steps, where they use a single linear layer to generate positions of all time-steps simultaneously. They dominate most motion prediction leaderboards due to the simplicity of training MLPs compared to autoregressive networks. In this paper, we introduce the GPT style next token prediction into motion forecasting. In this way, the input and output could be represented in a unified space and thus the autoregressive prediction becomes more feasible. However, different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations. To this end, we propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations, e.g., encoding the transformation between coordinate systems for spatial relativity while adopting RoPE for temporal relativity. Empirically, by equipping with the aforementioned tailored designs, the proposed method achieves state-of-the-art performance in the Waymo Open Motion and Waymo Interaction datasets. Notably, AMP outperforms other recent autoregressive motion prediction methods: MotionLM and StateTransformer, which demonstrates the effectiveness of the proposed designs.

翻译：摘要：作为自动驾驶的一项关键任务，运动预测旨在预测周围物体未来的状态以辅助导航。一种自然的解决方案是以逐步方式估计其他智能体的位置，即每个预测时间步既依赖于己观测的时间步，也依赖于先前预测的时间步，这即为自回归预测。早期的研究工作如SocialLSTM和MFP基于这一直觉来设计其解码器。然而，几乎所有最先进的工作都假设所有预测时间步在给定观测时间步的条件下相互独立，它们使用单线性层同时生成所有时间步的位置。由于训练MLP相较于自回归网络更为简单，这些方法主导了多数运动预测排行榜。本文中，我们将GPT风格的下一令牌预测引入运动预测领域。通过这种方式，输入与输出可在统一空间中表示，从而使自回归预测更具可行性。然而，与由同质单元（词语）组成的语言数据不同，驾驶场景中的元素可能具有复杂的时空与语义关系。为此，我们提出采用三种具有不同邻域的因子化注意力模块进行信息聚合，并采用不同的位置编码方式以捕捉这些关系，例如：对空间相对性编码坐标系之间的变换，对时间相对性则采用旋转位置编码（RoPE）。实验证明，通过配备上述定制化设计，所提方法在Waymo公开运动数据集和Waymo交互数据集上均达到了最先进性能。值得注意的是，AMP优于其他近期自回归运动预测方法（如MotionLM和StateTransformer），这验证了所提设计的有效性。