Efficient Decision-based Black-box Patch Attacks on Video Recognition

Although Deep Neural Networks (DNNs) have demonstrated excellent performance, they are vulnerable to adversarial patches that introduce perceptible and localized perturbations to the input. Generating adversarial patches on images has received much attention, while adversarial patches on videos have not been well investigated. Further, decision-based attacks, where attackers only access the predicted hard labels by querying threat models, have not been well explored on video models either, even if they are practical in real-world video recognition scenes. The absence of such studies leads to a huge gap in the robustness assessment for video models. To bridge this gap, this work first explores decision-based patch attacks on video models. We analyze that the huge parameter space brought by videos and the minimal information returned by decision-based models both greatly increase the attack difficulty and query burden. To achieve a query-efficient attack, we propose a spatial-temporal differential evolution (STDE) framework. First, STDE introduces target videos as patch textures and only adds patches on keyframes that are adaptively selected by temporal difference. Second, STDE takes minimizing the patch area as the optimization objective and adopts spatialtemporal mutation and crossover to search for the global optimum without falling into the local optimum. Experiments show STDE has demonstrated state-of-the-art performance in terms of threat, efficiency and imperceptibility. Hence, STDE has the potential to be a powerful tool for evaluating the robustness of video recognition models.

翻译：尽管深度神经网络（DNNs）表现出色，但它们易受到对抗性补丁的攻击，这些补丁对输入引入可感知且局部化的扰动。图像上的对抗性补丁生成已受到广泛关注，而视频上的对抗性补丁尚未得到充分研究。此外，基于决策的攻击——攻击者仅通过查询威胁模型获取预测的硬标签——在视频模型上也未曾得到深入探索，尽管这类攻击在实际视频识别场景中具有实用性。此类研究的缺失导致视频模型鲁棒性评估存在巨大空白。为填补这一空白，本文首次探索了基于决策的视频模型补丁攻击。我们分析指出，视频带来的巨大参数空间以及基于决策模型返回的极少量信息，均极大地增加了攻击难度和查询负担。为实现高效的查询攻击，我们提出了一种时空差分进化（STDE）框架。首先，STDE将目标视频引入作为补丁纹理，并仅对通过时间差分自适应选择的关键帧添加补丁。其次，STDE以最小化补丁面积为优化目标，采用时空变异和交叉操作搜索全局最优解，避免陷入局部最优。实验表明，STDE在威胁性、效率和不可感知性方面均达到了最优性能。因此，STDE有潜力成为评估视频识别模型鲁棒性的强大工具。