ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling

Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and uncalibrated mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the Early-Match-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or heuristic post-processing, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the capability of mode extrapolation, which supports forecasting more behavior modes when the future is highly uncertain.

翻译：预测未来事件的多模态特性是确保自动驾驶安全的基础。然而，交通参与者的多模态运动预测一直因缺乏多模态真实数据而面临挑战。现有研究主要采用赢家通吃的训练策略来应对这一难题，但仍受限于轨迹多样性不足和模态置信度未校准的问题。尽管部分方法通过生成大量轨迹候选来缓解这些局限，但它们需要后处理阶段来识别最具代表性的模态，这一过程缺乏普适性原则且会损害轨迹精度。为此，我们提出ModeSeq——一种将模态建模为序列的新型多模态预测范式。与一次性解码多条可能轨迹的常见做法不同，ModeSeq要求运动解码器逐步推断后续模态，从而更显式地捕捉模态间的关联性，并显著增强多模态推理能力。借助序列化模态预测的归纳偏置，我们还提出了“早匹配赢家通吃”（EMTA）训练策略以进一步提升轨迹多样性。在不依赖密集模态预测或启发式后处理的情况下，ModeSeq显著提升了多模态输出的多样性，同时保持了令人满意的轨迹精度，在运动预测基准测试中实现了均衡的性能表现。此外，ModeSeq天然具备模态外推能力，可在未来高度不确定时支持预测更多行为模态。