Deep neural networks (DNNs) have achieved tremendous success in various applications including video action recognition, yet remain vulnerable to backdoor attacks (Trojans). The backdoor-compromised model will mis-classify to the target class chosen by the attacker when a test instance (from a non-target class) is embedded with a specific trigger, while maintaining high accuracy on attack-free instances. Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored. Current studies are direct extensions of approaches proposed for image data, e.g., the triggers are independently embedded within the frames, which tend to be detectable by existing defenses. In this paper, we introduce a simple yet effective backdoor attack against video data. Our proposed attack, adding perturbations in a transformed domain, plants an imperceptible, temporally distributed trigger across the video frames, and is shown to be resilient to existing defensive strategies. The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models on two video recognition benchmarks, UCF101 and HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL) dataset. We delve into the impact of several influential factors on our proposed attack and identify an intriguing effect termed "collateral damage" through extensive studies.
翻译:深度神经网络(DNNs)在包括视频动作识别在内的各种应用中取得了巨大成功,但仍易受后门攻击(特洛伊木马)的影响。被后门攻击的模型在测试实例(来自非目标类别)嵌入特定触发器时,会错误分类至攻击者选择的目标类别,同时对无攻击实例保持高准确性。尽管针对图像数据的后门攻击已有广泛研究,但基于视频的系统在后门攻击下的脆弱性仍基本未被探索。当前研究是图像数据方法的直接扩展,例如,触发器被独立嵌入到帧中,这往往容易被现有防御手段检测。本文提出了一种简单而有效的针对视频数据的后门攻击。我们所提出的攻击方法在变换域中添加扰动,跨视频帧植入难以察觉、时间上分布的触发器,并证明其对现有防御策略具有鲁棒性。通过在两个视频识别基准数据集UCF101和HMDB51以及一个手语识别基准数据集希腊手语(GSL)上,使用多种知名模型进行的大量实验,验证了所提出攻击的有效性。我们深入探究了若干影响因素对所提攻击的影响,并通过详尽研究识别出一种称为“附带损害”的有趣效应。