With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.
翻译:随着视频数据在实际应用中的重要性日益增加,利用时序信息的高效目标检测方法需求日益迫切。现有视频目标检测技术虽然采用多种策略应对这一挑战,但通常依赖于局部相邻帧或片段内随机采样图像。尽管基于Transformer的近期视频目标检测方法已展现出可喜成果,但其依赖多输入和额外网络复杂度来融合时序信息的特性限制了实际应用。本文提出一种面向单幅图像目标检测的新方法——上下文增强Transformer(CETR),通过新设计的记忆模块将时序上下文融入DETR。为高效存储时序信息,我们构建了收集跨数据上下文信息的类别级记忆体,并提出基于分类的采样技术以选择性利用当前图像的相关记忆。测试阶段,我们引入测试时记忆自适应方法,通过考虑测试分布更新各记忆函数。基于CityCam和ImageNet VID数据集的实验表明,该框架在各类视频系统上具有高效性。项目主页与代码将发布于:https://ku-cvlab.github.io/CETR