Deep neural networks (DNNs) have achieved tremendous success in various applications including video action recognition, yet remain vulnerable to backdoor attacks (Trojans). The backdoor-compromised model will mis-classify to the target class chosen by the attacker when a test instance (from a non-target class) is embedded with a specific trigger, while maintaining high accuracy on attack-free instances. Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored. Current studies are direct extensions of approaches proposed for image data, e.g., the triggers are \textbf{independently} embedded within the frames, which tend to be detectable by existing defenses. In this paper, we introduce a \textit{simple} yet \textit{effective} backdoor attack against video data. Our proposed attack, adding perturbations in a transformed domain, plants an \textbf{imperceptible, temporally distributed} trigger across the video frames, and is shown to be resilient to existing defensive strategies. The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models on two video recognition benchmarks, UCF101 and HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL) dataset. We delve into the impact of several influential factors on our proposed attack and identify an intriguing effect termed "collateral damage" through extensive studies.
翻译:深度神经网络(DNNs)在包括视频动作识别在内的多种应用中取得了巨大成功,但仍易受到后门攻击(特洛伊木马)的威胁。被植入后门的模型在面对嵌入了特定触发器的测试样本(来自非目标类别)时,会将其错误分类为攻击者选定的目标类别,同时保持对无攻击样本的高准确率。尽管针对图像数据的后门攻击已有广泛研究,但基于视频的系统在后门攻击下的脆弱性仍未得到充分探索。当前研究仅是针对图像数据方法的直接扩展,例如触发器被**独立地**嵌入各帧中,这往往容易被现有防御机制检测到。本文提出了一种**简单**而**有效**的视频数据后门攻击方法。我们提出的攻击在变换域中添加扰动,将**不可察觉、时间分布式**的触发器植入视频帧序列中,并证明其对现有防御策略具有鲁棒性。通过在两个视频识别基准(UCF101 和 HMDB51)以及一个手语识别基准(希腊手语 GSL 数据集)上对多种知名模型进行广泛实验,验证了所提攻击的有效性。我们深入探讨了多种影响因素对所提攻击的影响,并通过广泛研究发现了一种被称为“附带损害”的有趣效应。