Real-time perception, or streaming perception, is a crucial aspect of autonomous driving that has yet to be thoroughly explored in existing research. To address this gap, we present DAMO-StreamNet, an optimized framework that combines recent advances from the YOLO series with a comprehensive analysis of spatial and temporal perception mechanisms, delivering a cutting-edge solution. The key innovations of DAMO-StreamNet are: (1) A robust neck structure incorporating deformable convolution, enhancing the receptive field and feature alignment capabilities. (2) A dual-branch structure that integrates short-path semantic features and long-path temporal features, improving motion state prediction accuracy. (3) Logits-level distillation for efficient optimization, aligning the logits of teacher and student networks in semantic space. (4) A real-time forecasting mechanism that updates support frame features with the current frame, ensuring seamless streaming perception during inference. Our experiments demonstrate that DAMO-StreamNet surpasses existing state-of-the-art methods, achieving 37.8% (normal size (600, 960)) and 43.3% (large size (1200, 1920)) sAP without using extra data. This work not only sets a new benchmark for real-time perception but also provides valuable insights for future research. Additionally, DAMO-StreamNet can be applied to various autonomous systems, such as drones and robots, paving the way for real-time perception. The code is available at https://github.com/zhiqic/DAMO-StreamNet.
翻译:实时感知(即流式感知)是自动驾驶中的关键环节,但现有研究尚未对此进行深入探索。为填补这一空白,我们提出DAMO-StreamNet——一种融合YOLO系列最新进展与时空感知机制综合分析而构建的优化框架,提供了前沿解决方案。该框架的核心创新包括:(1)一种融合可变形卷积的稳健颈部结构,增强了感受野与特征对齐能力;(2)一种双分支结构,整合短路径语义特征与长路径时间特征,提升了运动状态预测精度;(3)面向高效优化的logits级蒸馏,在语义空间中对齐教师网络与学生网络的logits;(4)一种实时预测机制,利用当前帧更新支撑帧特征,确保推理过程中流式感知的无缝衔接。实验表明,DAMO-StreamNet超越了现有最优方法,在不使用额外数据的情况下,在常规尺寸(600,960)上达到37.8% sAP,在大尺寸(1200,1920)上达到43.3% sAP。该工作不仅为实时感知设立了新基准,还为未来研究提供了重要见解。此外,DAMO-StreamNet可应用于无人机、机器人等多种自主系统,为实时感知开辟了道路。代码已开源至https://github.com/zhiqic/DAMO-StreamNet。