Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of "looking backward to look forward", and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR's effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at https://github.com/aminmed/MASAR.
翻译:传统自动驾驶系统通过手工设计的边界框接口连接感知与预测模块,限制了信息流动并将误差传播至下游任务。近期研究致力于开发端到端模型以联合处理感知与预测问题,然而这些方法往往未能充分利用外观与运动线索间的协同作用,主要依赖短期视觉特征。我们遵循“回望以预见”的理念,提出MASAR——一种与任何基于Transformer的3D检测器兼容的、完全可微分的联合3D检测与轨迹预测新框架。MASAR采用以对象为中心的时空机制,对外观与运动特征进行联合编码。通过预测历史轨迹并利用外观线索引导进行优化,MASAR能够捕捉长期时间依赖性,从而提升未来轨迹预测的准确性。在nuScenes数据集上的实验验证了MASAR的有效性,其在保持稳健检测性能的同时,将minADE与minFDE指标提升了超过20%。代码与模型已发布于https://github.com/aminmed/MASAR。