Visual-based perception is the key module for autonomous driving. Among those visual perception tasks, video object detection is a primary yet challenging one because of feature degradation caused by fast motion or multiple poses. Current models usually aggregate features from the neighboring frames to enhance the object representations for the task heads to generate more accurate predictions. Though getting better performance, these methods rely on the information from the future frames and suffer from high computational complexity. Meanwhile, the aggregation process is not reconfigurable during the inference time. These issues make most of the existing models infeasible for online applications. To solve these problems, we introduce a stepwise spatial global-local aggregation network. Our proposed models mainly contain three parts: 1). Multi-stage stepwise network gradually refines the predictions and object representations from the previous stage; 2). Spatial global-local aggregation fuses the local information from the neighboring frames and global semantics from the current frame to eliminate the feature degradation; 3). Dynamic aggregation strategy stops the aggregation process early based on the refinement results to remove redundancy and improve efficiency. Extensive experiments on the ImageNet VID benchmark validate the effectiveness and efficiency of our proposed models.
翻译:基于视觉的感知是自动驾驶的核心模块。在各类视觉感知任务中,视频目标检测因其因快速运动或多姿态引起的特征退化问题而成为一项关键且具有挑战性的任务。现有模型通常通过聚合相邻帧的特征来增强目标表征,以帮助任务头生成更准确的预测。尽管这些方法取得了更好的性能,但它们依赖于未来帧的信息,且计算复杂度较高。同时,聚合过程在推理期间无法重新配置。这些问题使得大多数现有模型难以应用于在线场景。为解决上述问题,本文提出一种渐进式空间全局-局部聚合网络。所提模型主要包含三个部分:1)多阶段渐进式网络,逐步优化前一阶段的预测结果与目标表征;2)空间全局-局部聚合模块,融合相邻帧的局部信息与当前帧的全局语义以消除特征退化;3)动态聚合策略,根据优化结果提前终止聚合过程以减少冗余并提升效率。在ImageNet VID基准数据集上的大量实验验证了所提模型的有效性与高效性。