Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?

Motion forecasting is crucial in enabling autonomous vehicles to anticipate the future trajectories of surrounding agents. To do so, it requires solving mapping, detection, tracking, and then forecasting problems, in a multi-step pipeline. In this complex system, advances in conventional forecasting methods have been made using curated data, i.e., with the assumption of perfect maps, detection, and tracking. This paradigm, however, ignores any errors from upstream modules. Meanwhile, an emerging end-to-end paradigm, that tightly integrates the perception and forecasting architectures into joint training, promises to solve this issue. So far, however, the evaluation protocols between the two methods were incompatible and their comparison was not possible. In fact, and perhaps surprisingly, conventional forecasting methods are usually not trained nor tested in real-world pipelines (e.g., with upstream detection, tracking, and mapping modules). In this work, we aim to bring forecasting models closer to real-world deployment. First, we propose a unified evaluation pipeline for forecasting methods with real-world perception inputs, allowing us to compare the performance of conventional and end-to-end methods for the first time. Second, our in-depth study uncovers a substantial performance gap when transitioning from curated to perception-based data. In particular, we show that this gap (1) stems not only from differences in precision but also from the nature of imperfect inputs provided by perception modules, and that (2) is not trivially reduced by simply finetuning on perception outputs. Based on extensive experiments, we provide recommendations for critical areas that require improvement and guidance towards more robust motion forecasting in the real world. We will release an evaluation library to benchmark models under standardized and practical conditions.

翻译：运动预测是使自动驾驶车辆能够预判周围智能体未来轨迹的关键技术。为此，需在多阶段流程中依次解决地图构建、目标检测、目标跟踪及预测问题。在复杂系统中，传统预测方法基于精炼数据（即假设完美地图、检测与跟踪）取得进展，然而这一范式忽略了上游模块的误差。与此同时，新兴的端到端范式通过紧密集成感知与预测架构进行联合训练，有望解决该问题。但迄今为止，两种方法的评估标准不兼容，无法进行直接比较。事实上，或许令人意外的是，传统预测方法通常未在真实流程（例如结合上游检测、跟踪与地图构建模块）中接受训练或测试。本研究旨在将预测模型更贴近真实世界部署场景。首先，我们提出一个统一的评估流程，用于测试基于真实感知输入的运动预测方法，从而首次实现了传统方法与端到端方法的性能对比。其次，通过深入研究，我们揭示了从精炼数据转向感知驱动数据时存在的显著性能差距。具体而言，我们发现该差距（1）不仅源于精度差异，更源于感知模块提供的不完美输入的本质属性；（2）无法通过简单微调感知输出来轻易缩小。基于大量实验，我们提出了关键改进领域的建议，并指导如何构建更鲁棒的真实世界运动预测方案。我们将发布一个评估库，用于在标准化且贴近实际的条件下对模型进行基准测试。