Identity-Consistent Aggregation for Video Object Detection

In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame. Existing methods treat the temporal contexts obtained from different objects indiscriminately and ignore their different identities. While intuitively, aggregating local views of the same object in different frames may facilitate a better understanding of the object. Thus, in this paper, we aim to enable the model to focus on the identity-consistent temporal contexts of each object to obtain more comprehensive object representations and handle the rapid object appearance variations such as occlusion, motion blur, etc. However, realizing this goal on top of existing VID models faces low-efficiency problems due to their redundant region proposals and nonparallel frame-wise prediction manner. To aid this, we propose ClipVID, a VID model equipped with Identity-Consistent Aggregation (ICA) layers specifically designed for mining fine-grained and identity-consistent temporal contexts. It effectively reduces the redundancies through the set prediction strategy, making the ICA layers very efficient and further allowing us to design an architecture that makes parallel clip-wise predictions for the whole video clip. Extensive experimental results demonstrate the superiority of our method: a state-of-the-art (SOTA) performance (84.7% mAP) on the ImageNet VID dataset while running at a speed about 7x faster (39.3 fps) than previous SOTAs.

翻译：在视频目标检测（VID）中，常见做法是利用视频中丰富的时序上下文增强每帧的目标表征。现有方法无差别地处理从不同目标获取的时序上下文，忽略其不同身份。而直观上，聚合同一目标在不同帧中的局部视图可能有助于更全面地理解该目标。因此，本文旨在使模型聚焦于每个目标的身份一致性时序上下文，从而获取更全面的目标表征，并应对快速的目标外观变化（如遮挡、运动模糊等）。然而，在现有VID模型基础上实现这一目标存在效率低下的问题，原因在于其冗余的区域提议和非并行的逐帧预测方式。为解决此问题，我们提出ClipVID——一种配备身份一致性聚合（ICA）层的VID模型，该层专门设计用于挖掘细粒度的身份一致性时序上下文。通过集合预测策略有效降低冗余，使ICA层具有极高效率，进而允许我们设计一种对整段视频片段进行并行剪辑级预测的架构。大量实验结果表明了本文方法的优越性：在ImageNet VID数据集上达到最新最优（SOTA）性能（84.7% mAP），同时运行速度约为先前SOTA方法的7倍（39.3 fps）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日