Recently, many breakthroughs are made in the field of Video Object Detection (VOD), but the performance is still limited due to the imaging limitations of RGB sensors in adverse illumination conditions. To alleviate this issue, this work introduces a new computer vision task called RGB-thermal (RGBT) VOD by introducing the thermal modality that is insensitive to adverse illumination conditions. To promote the research and development of RGBT VOD, we design a novel Erasure-based Interaction Network (EINet) and establish a comprehensive benchmark dataset (VT-VOD50) for this task. Traditional VOD methods often leverage temporal information by using many auxiliary frames, and thus have large computational burden. Considering that thermal images exhibit less noise than RGB ones, we develop a negative activation function that is used to erase the noise of RGB features with the help of thermal image features. Furthermore, with the benefits from thermal images, we rely only on a small temporal window to model the spatio-temporal information to greatly improve efficiency while maintaining detection accuracy. VT-VOD50 dataset consists of 50 pairs of challenging RGBT video sequences with complex backgrounds, various objects and different illuminations, which are collected in real traffic scenarios. Extensive experiments on VT-VOD50 dataset demonstrate the effectiveness and efficiency of our proposed method against existing mainstream VOD methods. The code of EINet and the dataset will be released to the public for free academic usage.
翻译:近年来,视频目标检测(VOD)领域取得了诸多突破性进展,但在恶劣光照条件下,由于RGB传感器的成像局限性,其性能仍受到限制。为缓解这一问题,本文引入一种新的计算机视觉任务——RGB-热成像(RGBT)视频目标检测,通过引入对恶劣光照条件不敏感的热模态信息。为推动RGBT视频目标检测的研究与发展,我们设计了一种新颖的基于擦除的交互网络(EINet),并为该任务构建了全面的基准数据集(VT-VOD50)。传统VOD方法通常利用多个辅助帧来挖掘时序信息,因而计算负担较大。考虑到热成像图像噪声低于RGB图像,我们开发了一种负激活函数,借助热成像图像特征来擦除RGB特征中的噪声。此外,得益于热成像图像的辅助,我们仅需较小的时域窗口即可建模时空信息,从而在保持检测精度的同时大幅提升效率。VT-VOD50数据集由50对具有挑战性的RGBT视频序列组成,包含复杂背景、多样化目标及不同光照条件,均采集自真实交通场景。在VT-VOD50数据集上的大量实验表明,我们提出的方法相较于现有主流VOD方法具有有效性和高效性。EINet的代码及数据集将面向学术界免费开源发布。