There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes. The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at https://github.com/SSSpc333/TENet.
翻译:当前,通过引入对场景运动信息尤为丰富的视觉事件相机输出以增强RGB模态,从而改进视觉目标跟踪的方法引发了广泛关注。然而,现有RGB-E跟踪方法采用针对纯RGB跟踪优化的传统外观模型进行事件特征提取,未能根据事件数据的内在特性进行调整。针对此问题,我们提出一种事件骨干网络(Pooler),旨在获取感知事件数据稀疏性本质特性的高质量特征表示。具体而言,通过引入多尺度池化机制,利用不同尺寸的池化核捕获事件数据中全部运动特征趋势。此外,创新性自适应互导融合模块(MGF)建立了RGB与事件特征之间的关联。大量实验结果表明,本方法在两个广泛使用的RGB-E跟踪数据集(VisEvent和COESOT)上显著优于现有最优跟踪器,其中在COESOT数据集上的精度和成功率分别提升4.9%和5.2%。相关代码将发布于https://github.com/SSSpc333/TENet。