Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video BagNet on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.
翻译:先前关于长期视频动作识别的研究主要依赖具有大时域感受野的深度三维卷积模型。我们论证这类模型并非视频时域建模的最优选择:大的时域感受野使模型能够编码视频中精确的子动作顺序,当测试视频呈现不同的子动作顺序时,这会导致性能下降。本研究探讨通过缩减动作识别模型的时域感受野来提升其对子动作顺序的鲁棒性。为此,我们设计了视频包袋网络,这是3D ResNet-50模型的一种变体,其时域感受野规模限制为1、9、17或33帧。我们在合成数据集和真实视频数据集上分析视频包袋网络,并通过实验比较不同时域感受野的模型性能。研究发现:短时域感受野对子动作顺序变化具有鲁棒性,而较大的时域感受野对子动作顺序则较为敏感。