Visual Object Tracking (VOT) is an attractive and significant research area in computer vision, which aims to recognize and track specific targets in video sequences where the target objects are arbitrary and class-agnostic. The VOT technology could be applied in various scenarios, processing data of diverse modalities such as RGB, thermal infrared and point cloud. Besides, since no one sensor could handle all the dynamic and varying environments, multi-modal VOT is also investigated. This paper presents a comprehensive survey of the recent progress of both single-modal and multi-modal VOT, especially the deep learning methods. Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking. In particular, we conclude four widely-used single-modal frameworks, abstracting their schemas and categorizing the existing inheritors. Then we summarize four kinds of multi-modal VOT, including RGB-Depth, RGB-Thermal, RGB-LiDAR and RGB-Language. Moreover, the comparison results in plenty of VOT benchmarks of the discussed modalities are presented. Finally, we provide recommendations and insightful observations, inspiring the future development of this fast-growing literature.
翻译:视觉目标跟踪是计算机视觉领域一个极具吸引力且意义重大的研究方向,其目标是在视频序列中识别并跟踪特定目标,且这些目标对象具有任意性和类别无关性。VOT技术可应用于多种场景,处理包括RGB、热红外和点云在内的多模态数据。此外,由于单一传感器难以应对所有动态变化的环境,多模态VOT也受到广泛研究。本文系统综述了单模态与多模态VOT的最新进展,特别聚焦于深度学习方法。具体而言,我们首先回顾了三种主流的单模态VOT,包括RGB跟踪、热红外跟踪和点云跟踪。特别地,我们归纳了四种广泛使用的单模态框架,提炼其架构范式并对现有继承方法进行分类。随后总结了四类多模态VOT,包括RGB-深度、RGB-热成像、RGB-激光雷达以及RGB-语言模态。此外,本文展示了所讨论模态在多个VOT基准测试中的对比结果。最后,我们提出具有建设性的建议与深刻见解,以期启发这一快速发展领域的未来研究方向。