See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems

Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.

翻译：语言-视觉理解推动了先进感知系统的发展，其中最突出的便是新兴的指代式多目标跟踪（RMOT）范式。通过利用自然语言查询，RMOT系统能够选择性地跟踪满足给定语义描述的目标，并通过基于Transformer的时空推理模块进行引导。端到端（E2E）RMOT模型进一步将特征提取、时间记忆和空间推理统一在Transformer骨干网络中，实现对融合文本-视觉表征的长程时空建模。尽管取得了这些进展，RMOT的可靠性和鲁棒性仍未被充分探索。本文从设计逻辑的角度审视RMOT系统的安全性影响，识别出既损害语言-视觉指代又损害目标-跟踪匹配组件的对抗脆弱性。此外，我们揭示了采用基于FIFO内存的高级RMOT模型中一种新型脆弱性：针对其时空推理的定向且一致的攻击，会导致错误在历史缓冲区内持续存在，影响后续多个帧。我们提出了VEIL，一种旨在破坏RMOT模型统一指代-匹配机制的新型对抗框架。我们展示了精心设计的数字和物理扰动能够可靠地破坏跟踪逻辑，引发跟踪身份切换和终止。我们使用Refer-KITTI数据集进行了全面评估，验证了VEIL的有效性，并证明了在关键大规模应用中亟需安全感知的RMOT设计。