Spatio-temporal alignment is crucial for temporal modeling of end-to-end (E2E) perception in autonomous driving (AD), providing valuable structural and textural prior information. Existing methods typically rely on the attention mechanism to align objects across frames, simplifying the motion model with a unified explicit physical model (constant velocity, etc.). These approaches prefer semantic features for implicit alignment, challenging the importance of explicit motion modeling in the traditional perception paradigm. However, variations in motion states and object features across categories and frames render this alignment suboptimal. To address this, we propose HAT, a spatio-temporal alignment module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses without direct supervision. Specifically, HAT first utilizes multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It then performs multi-hypothesis decoding by incorporating semantic and motion cues embedded in cached object queries, ultimately providing the optimal alignment proposal for the target frame. On nuScenes, HAT consistently improves 3D temporal detectors and trackers across diverse baselines. It achieves state-of-the-art tracking results with 46.0% AMOTA on the test set when paired with the DETR3D detector. In an object-centric E2E AD method, HAT enhances perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces the collision rate by 32%. When semantics are corrupted (nuScenes-C), the enhancement of motion modeling by HAT enables more robust perception and planning in the E2E AD.
翻译:时空对齐对于自动驾驶中端到端感知的时间建模至关重要,它提供了宝贵的结构性和纹理性先验信息。现有方法通常依赖注意力机制跨帧对齐物体,采用统一的显式物理模型(如恒定速度模型等)简化运动模型。这些方法倾向于使用语义特征进行隐式对齐,这对传统感知范式中显式运动建模的重要性提出了挑战。然而,不同类别和帧之间运动状态及物体特征的差异使得这种对齐效果欠佳。为解决此问题,我们提出HAT,一种时空对齐模块,允许每个物体在无需直接监督的情况下,自适应地从多个假设中解码出最优对齐方案。具体而言,HAT首先利用多个显式运动模型为历史实例生成空间锚点和运动感知特征提案;随后,通过融合缓存物体查询中嵌入的语义与运动线索进行多假设解码,最终为目标帧提供最优对齐方案。在nuScenes数据集上,HAT在不同基线方法中持续提升了三维时序检测器与跟踪器的性能。当与DETR3D检测器结合时,在测试集上取得了46.0% AMOTA的领先跟踪结果。在以物体为中心的端到端自动驾驶方法中,HAT提升了感知精度(mAP提升1.3%,AMOTA提升3.1%)并降低了32%的碰撞率。当语义信息受损时(nuScenes-C场景),HAT通过增强运动建模能力,实现了端到端自动驾驶系统中更鲁棒的感知与规划。