Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.
翻译:先前研究表明,精心设计的对抗性扰动可威胁视频识别系统的安全性。当扰动具有语义不变性时(如StyleFool),攻击者能以低查询预算入侵此类模型。尽管查询效率较高,但StyleFool对每帧所有像素应用风格迁移的特性,使得微区域的自然度仍需改进。为弥补这一缺陷,我们提出LocalStyleFool——一种改进的黑盒视频对抗攻击方法,通过叠加基于区域风格迁移的扰动作用于视频。得益于分割一切模型(SAM)的广泛适用性与可扩展性,我们首先根据语义信息提取不同区域,并在视频流中对其进行追踪以维持时间一致性。随后,基于迁移梯度信息与区域面积的关联准则,选择若干区域施加风格迁移扰动,并进一步微调扰动以生成对抗性视频。人类评估调查表明,LocalStyleFool能同时提升帧内与帧间自然度,同时保持竞争力的欺骗率与查询效率。在高分辨率数据集上的成功实验也证明,SAM的精细分割有助于提升高分辨率数据下对抗攻击的可扩展性。