CRSOT: Cross-Resolution Object Tracking using Unaligned Frame and Event Cameras

Existing datasets for RGB-DVS tracking are collected with DVS346 camera and their resolution ($346 \times 260$) is low for practical applications. Actually, only visible cameras are deployed in many practical systems, and the newly designed neuromorphic cameras may have different resolutions. The latest neuromorphic sensors can output high-definition event streams, but it is very difficult to achieve strict alignment between events and frames on both spatial and temporal views. Therefore, how to achieve accurate tracking with unaligned neuromorphic and visible sensors is a valuable but unresearched problem. In this work, we formally propose the task of object tracking using unaligned neuromorphic and visible cameras. We build the first unaligned frame-event dataset CRSOT collected with a specially built data acquisition system, which contains 1,030 high-definition RGB-Event video pairs, 304,974 video frames. In addition, we propose a novel unaligned object tracking framework that can realize robust tracking even using the loosely aligned RGB-Event data. Specifically, we extract the template and search regions of RGB and Event data and feed them into a unified ViT backbone for feature embedding. Then, we propose uncertainty perception modules to encode the RGB and Event features, respectively, then, we propose a modality uncertainty fusion module to aggregate the two modalities. These three branches are jointly optimized in the training phase. Extensive experiments demonstrate that our tracker can collaborate the dual modalities for high-performance tracking even without strictly temporal and spatial alignment. The source code, dataset, and pre-trained models will be released at https://github.com/Event-AHU/Cross_Resolution_SOT.

翻译：现有RGB-DVS跟踪数据集均采用DVS346相机采集，其分辨率（$346 \times 260$）在实际应用中偏低。事实上，许多实际系统仅部署了可见光相机，而新型神经形态相机可能具有不同分辨率。最新神经形态传感器能够输出高清事件流，但在空间和时间维度上实现事件与帧的严格对齐极为困难。因此，如何利用未对齐的神经形态传感器与可见光传感器实现精准跟踪，是一个有价值但尚未被研究的问题。本文正式提出利用未对齐的神经形态相机与可见光相机的目标跟踪任务。我们通过专门构建的数据采集系统建立了首个未对齐帧-事件数据集CRSOT，包含1,030个高清RGB-事件视频对（共304,974帧）。此外，我们提出了一种新颖的未对齐目标跟踪框架，即使使用松耦合的RGB-事件数据也能实现鲁棒跟踪。具体而言，我们提取RGB和事件数据的模板与搜索区域，并将其输入统一的ViT主干网络进行特征嵌入；然后分别提出不确定性感知模块对RGB和事件特征进行编码，并提出模态不确定性融合模块来聚合两种模态。这三个分支在训练阶段联合优化。大量实验表明，即使没有严格的时间与空间对齐，我们的跟踪器仍能协同双模态实现高性能跟踪。源代码、数据集及预训练模型将发布于https://github.com/Event-AHU/Cross_Resolution_SOT。