In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. Previous adaptation methods have simultaneously considered spatial and temporal modeling with a unified learnable module but still suffered from fully leveraging the representative capabilities of image transformers. We argue that the popular dual-path (two-stream) architecture in video models can mitigate this problem. We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block. Especially for temporal dynamic modeling, we incorporate consecutive frames into a grid-like frameset to precisely imitate vision transformers' capability that extrapolates relationships between tokens. In addition, we extensively investigate the multiple baselines from a unified perspective in video understanding and compare them with DualPath. Experimental results on four action recognition benchmarks prove that pretrained image transformers with DualPath can be effectively generalized beyond the data domain.
翻译:本文高效地将视觉基础模型(如ViT和Swin)的卓越表征能力迁移至视频理解任务,仅需少量可训练参数。现有自适应方法通常利用统一的可学习模块同时处理空间与时序建模,但仍难以充分发挥图像变压器的表征能力。我们认为视频模型中广泛采用的双路径(双流)架构可缓解该问题。为此,我们提出一种新型双路径自适应方法,将空间自适应路径与时序自适应路径分离,并在每个变压器块中部署轻量级瓶颈适配器。针对时序动态建模,我们将连续帧整合为网格状帧集合,精准模拟视觉变压器在标记间关系外推方面的能力。此外,我们从统一视角对视频理解中的多种基线方法进行广泛研究,并将其与双路径方法进行对比。四个动作识别基准数据集上的实验证明,采用双路径自适应的预训练图像变压器能够有效泛化至跨数据域场景。