Current video object detection (VOD) models often encounter issues with over-aggregation due to redundant aggregation strategies, which perform feature aggregation on every frame. This results in suboptimal performance and increased computational complexity. In this work, we propose an image-level Object Detection Difficulty (ODD) metric to quantify the difficulty of detecting objects in a given image. The derived ODD scores can be used in the VOD process to mitigate over-aggregation. Specifically, we train an ODD predictor as an auxiliary head of a still-image object detector to compute the ODD score for each image based on the discrepancies between detection results and ground-truth bounding boxes. The ODD score enhances the VOD system in two ways: 1) it enables the VOD system to select superior global reference frames, thereby improving overall accuracy; and 2) it serves as an indicator in the newly designed ODD Scheduler to eliminate the aggregation of frames that are easy to detect, thus accelerating the VOD process. Comprehensive experiments demonstrate that, when utilized for selecting global reference frames, ODD-VOD consistently enhances the accuracy of Global-frame-based VOD models. When employed for acceleration, ODD-VOD consistently improves the frames per second (FPS) by an average of 73.3% across 8 different VOD models without sacrificing accuracy. When combined, ODD-VOD attains state-of-the-art performance when competing with many VOD methods in both accuracy and speed. Our work represents a significant advancement towards making VOD more practical for real-world applications.
翻译:当前视频目标检测模型常因冗余聚合策略(即对每一帧进行特征聚合)而面临过度聚合问题,导致性能欠佳且计算复杂度增加。本文提出一种图像级目标检测难度度量指标,用于量化给定图像中检测目标的困难程度。所推导的ODD分数可在视频目标检测过程中缓解过度聚合现象。具体而言,我们在静态图像目标检测器上训练一个辅助头作为ODD预测器,通过检测结果与真实标注框之间的差异计算每张图像的ODD分数。该ODD分数通过两种方式增强视频目标检测系统:1)使系统能够选择更优的全局参考帧,从而提升整体精度;2)作为新设计的ODD调度器的指示信号,跳过易检测帧的聚合过程,从而加速视频检测流程。大量实验表明:当用于选择全局参考帧时,ODD-VOD能持续提升基于全局帧的视频目标检测模型的精度;当用于加速时,ODD-VOD可在不牺牲精度的前提下,使8种不同视频目标检测模型的每秒帧数平均提升73.3%;当两者结合时,ODD-VOD在精度与速度两方面均能与众多先进视频目标检测方法相抗衡。本研究为推进视频目标检测在实际场景中的应用迈出了重要一步。