With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.
翻译:随着视频数据在实际应用中的重要性日益增加,利用时序信息的高效目标检测方法需求不断上升。现有视频目标检测(VOD)技术虽采用多种策略应对这一挑战,但通常依赖局部相邻帧或片段内随机采样的图像。尽管近期基于Transformer的VOD方法取得了显著成果,但其对多输入及额外网络复杂度的依赖限制了实际应用的可行性。本文提出一种面向单张图像目标检测的新方法——上下文增强Transformer(CETR),通过新设计的记忆模块将时序上下文融入DETR。为高效存储时序信息,我们构建了跨数据收集上下文信息的类别级记忆。此外,提出基于分类的采样技术以选择性利用当前图像的相关记忆。在测试阶段,我们引入测试时记忆自适应方法,通过考虑测试分布更新个体记忆功能。在CityCam和ImageNet VID数据集上的实验表明了该框架在各种视频系统中的有效性。项目页面与代码将发布于:https://ku-cvlab.github.io/CETR。