Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey, we analyze the main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled at the input level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition, we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
翻译:Transformer模型在处理长程交互方面取得了显著成功,使其成为视频建模的有力工具。然而,这类模型缺乏归纳偏置,且计算复杂度随输入长度呈二次方增长。当需要处理时序维度带来的高维度特征时,这些局限性进一步加剧。尽管已有综述分析Transformer在视觉领域的进展,但尚无研究深入聚焦于视频特定设计的解析。本综述重点分析基于Transformer进行视频建模的主要贡献与发展趋势。具体而言,我们首先探讨视频在输入层面的处理方式,随后研究针对视频处理效率提升、冗余降低、有效归纳偏置重建及长时序动态捕获所进行的架构改进。此外,我们还概述了不同训练范式,并探索了适用于视频的高效自监督学习策略。最后,在视频Transformer最常用的基准测试(即动作分类)中进行性能对比,发现即使计算复杂度更低,视频Transformer的性能仍优于3D卷积网络。