End-to-end perception and trajectory prediction from raw sensor data is one of the key capabilities for autonomous driving. Modular pipelines restrict information flow and can amplify upstream errors. Recent query-based, fully differentiable perception-and-prediction (PnP) models mitigate these issues, yet the complementarity of cameras and LiDAR in the query-space has not been sufficiently explored. Models often rely on fusion schemes that introduce heuristic alignment and discrete selection steps which prevent full utilization of available information and can introduce unwanted bias. We propose Li-ViP3D++, a query-based multimodal PnP framework that introduces Query-Gated Deformable Fusion (QGDF) to integrate multi-view RGB and LiDAR in query space. QGDF (i) aggregates image evidence via masked attention across cameras and feature levels, (ii) extracts LiDAR context through fully differentiable BEV sampling with learned per-query offsets, and (iii) applies query-conditioned gating to adaptively weight visual and geometric cues per agent. The resulting architecture jointly optimizes detection, tracking, and multi-hypothesis trajectory forecasting in a single end-to-end model. On nuScenes, Li-ViP3D++ improves end-to-end behavior and detection quality, achieving higher EPA (0.335) and mAP (0.502) while substantially reducing false positives (FP ratio 0.147), and it is faster than the prior Li-ViP3D variant (139.82 ms vs. 145.91 ms). These results indicate that query-space, fully differentiable camera-LiDAR fusion can increase robustness of end-to-end PnP without sacrificing deployability.
翻译:从原始传感器数据实现端到端的感知与轨迹预测是自动驾驶的关键能力之一。模块化流水线限制了信息流,并可能放大上游误差。近期基于查询的、完全可微的感知-预测(PnP)模型缓解了这些问题,但相机与激光雷达在查询空间中的互补性尚未得到充分探索。现有模型通常依赖于引入启发式对齐和离散选择步骤的融合方案,这阻碍了对可用信息的充分利用,并可能引入不必要的偏差。我们提出了Li-ViP3D++,一个基于查询的多模态PnP框架,它引入了查询门控可变形融合(QGDF)来在查询空间中集成多视角RGB图像与激光雷达数据。QGDF(i)通过跨相机和特征层的掩码注意力聚合图像证据,(ii)通过具有可学习的每查询偏移量的完全可微分鸟瞰图采样提取激光雷达上下文,以及(iii)应用查询条件门控机制,为每个智能体自适应地加权视觉与几何线索。由此产生的架构在一个单一的端到端模型中联合优化了检测、跟踪和多假设轨迹预测。在nuScenes数据集上,Li-ViP3D++提升了端到端的行为和检测质量,实现了更高的EPA(0.335)和mAP(0.502),同时显著降低了误报率(FP ratio 0.147),并且比先前的Li-ViP3D变体更快(139.82 ms vs. 145.91 ms)。这些结果表明,查询空间中完全可微的相机-激光雷达融合能够在不牺牲可部署性的前提下,提高端到端PnP的鲁棒性。