State-of-the-art video object detection methods maintain a memory structure, either a sliding window or a memory queue, to enhance the current frame using attention mechanisms. However, we argue that these memory structures are not efficient or sufficient because of two implied operations: (1) concatenating all features in memory for enhancement, leading to a heavy computational cost; (2) frame-wise memory updating, preventing the memory from capturing more temporal information. In this paper, we propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods: (1) light-weight key-set construction which can significantly reduce the computational cost; (2) fine-grained feature-wise updating strategy which enables our method to utilize knowledge from the whole video. To better enhance features from complementary levels, i.e., feature maps and proposals, we further propose a generalized enhancement operation (GEO) to aggregate multi-level features in a unified manner. We conduct extensive evaluations on the challenging ImageNetVID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS with ResNet-101. Code is available at https://github.com/guanxiongsun/vfe.pytorch.
翻译:当前先进的视频目标检测方法通过维护记忆结构(滑动窗口或记忆队列),利用注意力机制增强当前帧特征。然而,我们认为这类记忆结构存在两个效率与充分性瓶颈:(1)将记忆中的所有特征拼接后进行增强,导致计算开销过大;(2)逐帧更新记忆策略限制了模型对时序信息的捕获能力。本文提出了一种基于记忆库的多层级聚合架构MAMBA。具体而言,我们的记忆库通过两项创新操作消除现有方法的缺陷:(1)轻量级关键帧集构建方法,可显著降低计算成本;(2)细粒度特征级更新策略,使模型能够利用整段视频的全局知识。为更好地融合特征图与候选区域等互补层级特征,我们进一步提出广义增强操作(GEO),实现多层级特征的统一聚合。在具有挑战性的ImageNetVID数据集上的大量实验表明,与现有最先进方法相比,本方法在速度与精度方面均取得更优性能。更值得注意的是,采用ResNet-101骨干网络时,MAMBA在12.6/9.1 FPS下分别达到83.7%/84.6%的mAP。代码开源地址:https://github.com/guanxiongsun/vfe.pytorch。