Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when these trackers are deployed in the real world. To achieve high accuracy on both clean and adversarial data, we propose building a spatial-temporal continuous representation using the semantic text guidance of the object of interest. This novel continuous representation enables us to reconstruct incoming frames to maintain semantic and appearance consistency with the object of interest and its clean counterparts. As a result, our proposed method successfully defends against different SOTA adversarial tracking attacks while maintaining high accuracy on clean data. In particular, our method significantly increases tracking accuracy under adversarial attacks with around 90% relative improvement on UAV123, which is even higher than the accuracy on clean data.
翻译:视觉目标跟踪在基于视觉的自主系统中扮演关键角色,其目标是在实时视频中估计感兴趣目标的位置和尺寸。尽管该领域已取得显著进展,但当前最先进的跟踪器在面临输入帧中的对抗性扰动时常常失效,这可能导致这些跟踪器在现实世界部署时出现严重的鲁棒性与安全性问题。为了在干净数据和对抗数据上均实现高精度,我们提出利用感兴趣目标的语义文本引导构建时空连续表示。这种新颖的连续表示使我们能够重构输入帧,以保持与感兴趣目标及其干净对应物在语义和外观上的一致性。因此,本文方法成功防御了多种最先进的对抗追踪攻击,同时在干净数据上保持高精度。特别地,在UAV123数据集上,本文方法在对抗攻击下将追踪精度提升约90%(相对提升),甚至高于干净数据上的精度。