SSGA-Net: Stepwise Spatial Global-local Aggregation Networks for for Autonomous Driving

Visual-based perception is the key module for autonomous driving. Among those visual perception tasks, video object detection is a primary yet challenging one because of feature degradation caused by fast motion or multiple poses. Current models usually aggregate features from the neighboring frames to enhance the object representations for the task heads to generate more accurate predictions. Though getting better performance, these methods rely on the information from the future frames and suffer from high computational complexity. Meanwhile, the aggregation process is not reconfigurable during the inference time. These issues make most of the existing models infeasible for online applications. To solve these problems, we introduce a stepwise spatial global-local aggregation network. Our proposed models mainly contain three parts: 1). Multi-stage stepwise network gradually refines the predictions and object representations from the previous stage; 2). Spatial global-local aggregation fuses the local information from the neighboring frames and global semantics from the current frame to eliminate the feature degradation; 3). Dynamic aggregation strategy stops the aggregation process early based on the refinement results to remove redundancy and improve efficiency. Extensive experiments on the ImageNet VID benchmark validate the effectiveness and efficiency of our proposed models.

翻译：基于视觉的感知是自动驾驶的核心模块。在各类视觉感知任务中，视频目标检测因其因快速运动或多姿态引起的特征退化问题而成为一项关键且具有挑战性的任务。现有模型通常通过聚合相邻帧的特征来增强目标表征，以帮助任务头生成更准确的预测。尽管这些方法取得了更好的性能，但它们依赖于未来帧的信息，且计算复杂度较高。同时，聚合过程在推理期间无法重新配置。这些问题使得大多数现有模型难以应用于在线场景。为解决上述问题，本文提出一种渐进式空间全局-局部聚合网络。所提模型主要包含三个部分：1）多阶段渐进式网络，逐步优化前一阶段的预测结果与目标表征；2）空间全局-局部聚合模块，融合相邻帧的局部信息与当前帧的全局语义以消除特征退化；3）动态聚合策略，根据优化结果提前终止聚合过程以减少冗余并提升效率。在ImageNet VID基准数据集上的大量实验验证了所提模型的有效性与高效性。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日