Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?

Motion forecasting is crucial in enabling autonomous vehicles to anticipate the future trajectories of surrounding agents. To do so, it requires solving mapping, detection, tracking, and then forecasting problems, in a multi-step pipeline. In this complex system, advances in conventional forecasting methods have been made using curated data, i.e., with the assumption of perfect maps, detection, and tracking. This paradigm, however, ignores any errors from upstream modules. Meanwhile, an emerging end-to-end paradigm, that tightly integrates the perception and forecasting architectures into joint training, promises to solve this issue. However, the evaluation protocols between the two methods were so far incompatible and their comparison was not possible. In fact, conventional forecasting methods are usually not trained nor tested in real-world pipelines (e.g., with upstream detection, tracking, and mapping modules). In this work, we aim to bring forecasting models closer to the real-world deployment. First, we propose a unified evaluation pipeline for forecasting methods with real-world perception inputs, allowing us to compare conventional and end-to-end methods for the first time. Second, our in-depth study uncovers a substantial performance gap when transitioning from curated to perception-based data. In particular, we show that this gap (1) stems not only from differences in precision but also from the nature of imperfect inputs provided by perception modules, and that (2) is not trivially reduced by simply finetuning on perception outputs. Based on extensive experiments, we provide recommendations for critical areas that require improvement and guidance towards more robust motion forecasting in the real world. The evaluation library for benchmarking models under standardized and practical conditions is provided: \url{https://github.com/valeoai/MFEval}.

翻译：运动预测对于自动驾驶车辆预测周围智能体未来轨迹至关重要。为实现这一目标，需要在一个多步骤流程中依次解决建图、检测、跟踪和预测问题。在这一复杂系统中，传统预测方法的进展依赖于使用精炼数据，即假设地图、检测和跟踪结果完美无缺。然而，这一范式忽略了上游模块产生的任何误差。与此同时，一种新兴的端到端范式通过将感知与预测架构紧密集成进行联合训练，有望解决这一问题。然而，两种方法之间的评估协议至今不兼容，使得二者无法进行比较。事实上，传统预测方法通常未在真实流程（例如包含上游检测、跟踪和建图模块）中进行训练或测试。在本工作中，我们旨在将预测模型推向更贴近真实部署的场景。首先，我们提出了一种统一的评估流程，用于处理带有真实感知输入的运动预测方法，从而首次实现了传统方法与端到端方法的比较。其次，我们的深入研究揭示了从精炼数据转向基于感知的数据时存在的显著性能差距。具体而言，我们表明这一差距：（1）不仅源于精度差异，还源于感知模块提供的不完美输入的特性；（2）无法通过简单地在感知输出上进行微调来轻易缩小。基于大量实验，我们为需要改进的关键领域提供了建议，并指导如何在真实世界中实现更鲁棒的运动预测。我们在标准化且实用的条件下提供了基准模型评估库：\url{https://github.com/valeoai/MFEval}。