The primary challenge in Video Object Detection (VOD) is effectively exploiting temporal information to enhance object representations. Traditional strategies, such as aggregating region proposals, often suffer from feature variance due to the inclusion of background information. We introduce a novel instance mask-based feature aggregation approach, significantly refining this process and deepening the understanding of object dynamics across video frames. We present FAIM, a new VOD method that enhances temporal Feature Aggregation by leveraging Instance Mask features. In particular, we propose the lightweight Instance Feature Extraction Module (IFEM) to learn instance mask features and the Temporal Instance Classification Aggregation Module (TICAM) to aggregate instance mask and classification features across video frames. Using YOLOX as a base detector, FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on a single 2080Ti GPU, setting a new benchmark for the speed-accuracy trade-off. Additional experiments on multiple datasets validate that our approach is robust, method-agnostic, and effective in multi-object tracking, demonstrating its broader applicability to video understanding tasks.
翻译:视频目标检测(VOD)的主要挑战在于如何有效利用时序信息以增强目标表征。传统策略,例如聚合区域提议,常因包含背景信息而遭受特征方差问题。我们引入了一种新颖的基于实例掩码的特征聚合方法,显著改进了这一过程,并加深了对视频帧间目标动态的理解。我们提出了FAIM,一种新的VOD方法,它通过利用实例掩码特征来增强时序特征聚合。具体而言,我们提出了轻量级的实例特征提取模块(IFEM)来学习实例掩码特征,以及时序实例分类聚合模块(TICAM)来聚合跨视频帧的实例掩码和分类特征。以YOLOX作为基础检测器,FAIM在单张2080Ti GPU上以33 FPS的速度,在ImageNet VID数据集上实现了87.9%的mAP,为速度-精度权衡设定了新的基准。在多个数据集上的额外实验验证了我们的方法具有鲁棒性、方法无关性,并且在多目标跟踪中有效,证明了其在视频理解任务中更广泛的适用性。