Generative models (GMs) have received increasing research interest for their remarkable capacity to achieve comprehensive understanding. However, their potential application in the domain of multi-modal tracking has remained relatively unexplored. In this context, we seek to uncover the potential of harnessing generative techniques to address the critical challenge, information fusion, in multi-modal tracking. In this paper, we delve into two prominent GM techniques, namely, Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs). Different from the standard fusion process where the features from each modality are directly fed into the fusion block, we condition these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances. This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance. To quantitatively gauge the effectiveness of our approach, we conduct extensive experiments across two multi-modal tracking tasks, three baseline methods, and three challenging benchmarks. The experimental results demonstrate that the proposed generative-based fusion mechanism achieves state-of-the-art performance, setting new records on LasHeR and RGBD1K.
翻译:生成模型(GMs)因其在实现全面理解方面的显著能力而受到越来越多的研究关注。然而,它们在多模态跟踪领域的潜在应用仍相对未被探索。在此背景下,我们试图揭示利用生成技术应对多模态跟踪中关键挑战——信息融合——的潜力。本文深入研究了两种主流的生成模型技术,即条件生成对抗网络(CGANs)和扩散模型(DMs)。与标准融合过程(即直接将从各模态提取的特征输入融合模块)不同,我们在生成模型框架下将这些多模态特征与随机噪声结合,有效将原始训练样本转化为更难的实例。这种设计擅长从特征中提取判别性线索,从而提升最终跟踪性能。为定量评估我们方法的有效性,我们在两项多模态跟踪任务、三种基线方法和三个具有挑战性的基准数据集上进行了广泛实验。实验结果表明,所提出的基于生成的融合机制实现了最先进的性能,并在LasHeR和RGBD1K数据集上创造了新纪录。