This paper addresses the problem of cross-modal object tracking from RGB videos and event data. Rather than constructing a complex cross-modal fusion network, we explore the great potential of a pre-trained vision Transformer (ViT). Particularly, we delicately investigate plug-and-play training augmentations that encourage the ViT to bridge the vast distribution gap between the two modalities, enabling comprehensive cross-modal information interaction and thus enhancing its ability. Specifically, we propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively. To mitigate network oscillations resulting from the masking strategy and further amplify its positive effect, we then theoretically propose an orthogonal high-rank loss to regularize the attention matrix. Extensive experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and twostream trackers to a large extent in terms of both tracking precision and success rate. Our new perspective and findings will potentially bring insights to the field of leveraging powerful pre-trained ViTs to model cross-modal data. The code will be publicly available.
翻译:本文针对RGB视频与事件数据中的跨模态目标跟踪问题展开研究。不同于构建复杂的跨模态融合网络,我们探索了预训练视觉Transformer(ViT)的巨大潜力。具体而言,我们精心设计了即插即用的训练增强策略,促使ViT弥合两种模态间巨大的分布差异,实现全面的跨模态信息交互,从而提升其性能。我们提出一种掩码建模策略,通过随机遮蔽某些词元的特定模态,强制不同模态的词元主动进行交互。为缓解掩码策略导致的网络振荡并进一步增强其正面效果,我们理论性地提出了一种正交高阶损失函数以正则化注意力矩阵。大量实验表明,我们的即插即用训练增强技术能够显著提升当前最先进的单流和双流跟踪器的跟踪精度与成功率。我们的新视角与发现将为利用强大的预训练ViT建模跨模态数据领域带来深刻启示。相关代码将公开提供。