In video action recognition, shortcut static features can interfere with the learning of motion features, resulting in poor out-of-distribution (OOD) generalization. The video background is clearly a source of static bias, but the video foreground, such as the clothing of the actor, can also provide static bias. In this paper, we empirically verify the existence of foreground static bias by creating test videos with conflicting signals from the static and moving portions of the video. To tackle this issue, we propose a simple yet effective technique, StillMix, to learn robust action representations. Specifically, StillMix identifies bias-inducing video frames using a 2D reference network and mixes them with videos for training, serving as effective bias suppression even when we cannot explicitly extract the source of bias within each video frame or enumerate types of bias. Finally, to precisely evaluate static bias, we synthesize two new benchmarks, SCUBA for static cues in the background, and SCUFO for static cues in the foreground. With extensive experiments, we demonstrate that StillMix mitigates both types of static bias and improves video representations for downstream applications.
翻译:在视频动作识别中,捷径静态特征会干扰运动特征的学习,导致模型在分布外(OOD)泛化性能较差。视频背景显然是静态偏差的来源,但视频前景(如演员的服装)同样可能引入静态偏差。本文通过构建包含视频静态与运动部分冲突信号的测试视频,从实验上验证了前景静态偏差的存在。为解决该问题,我们提出了一种简单有效的技术StillMix,用于学习鲁棒的动作表示。具体而言,StillMix利用二维参考网络识别诱发偏差的视频帧,并将其与训练视频混合,即使无法显式提取每帧中偏差来源或枚举偏差类型,也能有效抑制偏差。最后,为精确评估静态偏差,我们合成了两个新基准:用于背景静态线索的SCUBA和用于前景静态线索的SCUFO。大量实验表明,StillMix能够缓解这两种类型的静态偏差,并提升下游应用的视频表示质量。