Joint Perception and Prediction for Autonomous Driving: A Survey

from arxiv, 24 pages, 5 sections, 7 figures, 7 tables. This work has been submitted to the IEEE Transactions on Intelligent Transportation Systems for possible publication

Perception and prediction modules are critical components of autonomous driving systems, enabling vehicles to navigate safely through complex environments. The perception module is responsible for perceiving the environment, including static and dynamic objects, while the prediction module is responsible for predicting the future behavior of these objects. These modules are typically divided into three tasks: object detection, object tracking, and motion prediction. Traditionally, these tasks are developed and optimized independently, with outputs passed sequentially from one to the next. However, this approach has significant limitations: computational resources are not shared across tasks, the lack of joint optimization can amplify errors as they propagate throughout the pipeline, and uncertainty is rarely propagated between modules, resulting in significant information loss. To address these challenges, the joint perception and prediction paradigm has emerged, integrating perception and prediction into a unified model through multi-task learning. This strategy not only overcomes the limitations of previous methods, but also enables the three tasks to have direct access to raw sensor data, allowing richer and more nuanced environmental interpretations. This paper presents the first comprehensive survey of joint perception and prediction for autonomous driving. We propose a taxonomy that categorizes approaches based on input representation, scene context modeling, and output representation, highlighting their contributions and limitations. Additionally, we present a qualitative analysis and quantitative comparison of existing methods. Finally, we discuss future research directions based on identified gaps in the state-of-the-art.

翻译：感知与预测模块是自动驾驶系统的关键组成部分，使车辆能够在复杂环境中安全导航。感知模块负责感知环境，包括静态与动态物体；预测模块则负责预测这些物体的未来行为。这些模块通常分为三个任务：目标检测、目标跟踪与运动预测。传统上，这些任务被独立开发与优化，其输出在流水线中顺序传递。然而，这种方法存在显著局限：任务间无法共享计算资源；缺乏联合优化会使得误差在流水线传播过程中不断放大；不确定性极少在模块间传递，导致严重的信息损失。为应对这些挑战，联合感知与预测范式应运而生，其通过多任务学习将感知与预测整合至统一模型中。该策略不仅克服了传统方法的局限，还使三个任务能够直接访问原始传感器数据，从而实现对环境更丰富、更细致的解读。本文首次对自动驾驶领域的联合感知与预测研究进行全面综述。我们提出一种基于输入表征、场景上下文建模与输出表征的分类体系，系统梳理各类方法并阐明其贡献与局限。此外，我们对现有方法进行了定性分析与定量比较。最后，基于当前研究存在的不足，我们探讨了未来的研究方向。