Awesome Multi-modal Object Tracking

Multi-modal object tracking (MMOT) is an emerging field that combines data from various modalities, \eg vision (RGB), depth, thermal infrared, event, language and audio, to estimate the state of an arbitrary object in a video sequence. It is of great significance for many applications such as autonomous driving and intelligent surveillance. In recent years, MMOT has received more and more attention. However, existing MMOT algorithms mainly focus on two modalities (\eg RGB+depth, RGB+thermal infrared, and RGB+language). To leverage more modalities, some recent efforts have been made to learn a unified visual object tracking model for any modality. Additionally, some large-scale multi-modal tracking benchmarks have been established by simultaneously providing more than two modalities, such as vision-language-audio (\eg WebUAV-3M) and vision-depth-language (\eg UniMod1K). To track the latest progress in MMOT, we conduct a comprehensive investigation in this report. Specifically, we first divide existing MMOT tasks into five main categories, \ie RGBL tracking, RGBE tracking, RGBD tracking, RGBT tracking, and miscellaneous (RGB+X), where X can be any modality, such as language, depth, and event. Then, we analyze and summarize each MMOT task, focusing on widely used datasets and mainstream tracking algorithms based on their technical paradigms (\eg self-supervised learning, prompt learning, knowledge distillation, generative models, and state space models). Finally, we maintain a continuously updated paper list for MMOT at https://github.com/983632847/Awesome-Multimodal-Object-Tracking.

翻译：多模态目标跟踪（MMOT）是一个新兴领域，它融合来自多种模态的数据，例如视觉（RGB）、深度、热红外、事件、语言和音频，以估计视频序列中任意目标的状态。该领域对于自动驾驶、智能监控等诸多应用具有重要意义。近年来，MMOT受到越来越多的关注。然而，现有的MMOT算法主要集中于两种模态（例如RGB+深度、RGB+热红外以及RGB+语言）。为了利用更多模态，近期一些研究致力于学习一个适用于任意模态的统一视觉目标跟踪模型。此外，通过同时提供两种以上的模态，一些大规模多模态跟踪基准数据集已被建立，例如视觉-语言-音频（如WebUAV-3M）和视觉-深度-语言（如UniMod1K）。为了追踪MMOT的最新进展，我们在本报告中进行了全面的调研。具体而言，我们首先将现有的MMOT任务划分为五大类，即RGBL跟踪、RGBE跟踪、RGBD跟踪、RGBT跟踪以及其他（RGB+X）类别，其中X可以是任意模态，如语言、深度和事件。随后，我们分析和总结了每一类MMOT任务，重点关注广泛使用的数据集以及基于不同技术范式（例如自监督学习、提示学习、知识蒸馏、生成模型和状态空间模型）的主流跟踪算法。最后，我们在 https://github.com/983632847/Awesome-Multimodal-Object-Tracking 维护了一个持续更新的MMOT相关论文列表。