Event cameras, or dynamic vision sensors, have recently achieved success from fundamental vision tasks to high-level vision researches. Due to its ability to asynchronously capture light intensity changes, event camera has an inherent advantage to capture moving objects in challenging scenarios including objects under low light, high dynamic range, or fast moving objects. Thus event camera are natural for visual object tracking. However, the current event-based trackers derived from RGB trackers simply modify the input images to event frames and still follow conventional tracking pipeline that mainly focus on object texture for target distinction. As a result, the trackers may not be robust dealing with challenging scenarios such as moving cameras and cluttered foreground. In this paper, we propose a distractor-aware event-based tracker that introduces transformer modules into Siamese network architecture (named DANet). Specifically, our model is mainly composed of a motion-aware network and a target-aware network, which simultaneously exploits both motion cues and object contours from event data, so as to discover motion objects and identify the target object by removing dynamic distractors. Our DANet can be trained in an end-to-end manner without any post-processing and can run at over 80 FPS on a single V100. We conduct comprehensive experiments on two large event tracking datasets to validate the proposed model. We demonstrate that our tracker has superior performance against the state-of-the-art trackers in terms of both accuracy and efficiency.
翻译:事件相机(或称动态视觉传感器)近期已从基础视觉任务拓展至高级视觉研究领域成功应用。由于其能够异步捕捉光强变化,事件相机在低光照、高动态范围或快速运动物体等挑战性场景中,具备捕捉运动物体的固有优势。因此,事件相机天然适用于视觉目标跟踪。然而,当前基于RGB跟踪器改编的事件跟踪方法仅将输入图像简单转化为事件帧,仍遵循传统聚焦目标纹理进行区分的主流跟踪流程,导致其在相机运动或前景遮挡等复杂场景中鲁棒性不足。本文提出一种面向干扰感知的事件跟踪器,通过在孪生网络架构中引入Transformer模块(命名为DANet)。具体而言,该模型主要由运动感知网络与目标感知网络构成,可同步挖掘事件数据中的运动线索与目标轮廓信息,从而发现运动物体并通过移除动态干扰来识别目标。DANet支持端到端训练且无需后处理,在单块V100上可实现超过80 FPS的实时处理。我们在两个大规模事件跟踪数据集上开展全面实验验证所提模型,结果表明该跟踪器在精度与效率两方面均显著优于当前最先进方法。