ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus toward parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational costs to deal with the domain gap and temporal modeling in videos. In this paper, we present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the original models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of image-to-video adaptation, we exploit the flexibility of self-attention and introduce spatial-temporal dual-headed attention (STDHA). This approach efficiently endows the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy that utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Thanks to the customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, enabling zero extra cost during inference. Extensive experiments on representative fully-supervised and few-shot video recognition benchmarks showcase that ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

翻译：将图像模型适配至视频领域已成为解决视频识别任务的一种高效范式。由于图像模型参数量巨大且具有有效的可迁移性，进行全量微调效率较低，甚至是不必要的。因此，近期研究正将焦点转向参数高效的图像到视频适配方法。然而，这些适配策略不可避免地会引入额外的计算成本，以处理视频中的领域差异与时序建模问题。本文提出了一种新的适配范式（ZeroI2V），旨在将图像Transformer迁移至视频识别任务（即在推理阶段对原始模型引入零额外成本）。为实现这一目标，我们提出了两个核心设计。首先，为捕捉视频中的动态信息并降低图像到视频适配的难度，我们利用自注意力机制的灵活性，引入了时空双头注意力（STDHA）。该方法以零额外参数和计算量，高效地赋予图像Transformer时序建模能力。其次，为处理图像与视频之间的领域差异，我们提出了一种线性适配策略，利用轻量级密集放置的线性适配器，将冻结的图像模型完全迁移至视频识别任务。得益于定制的线性设计，所有新增的适配器在训练后均可通过结构重参数化轻松与原始模块融合，从而实现推理阶段的零额外成本。在具有代表性的全监督与少样本视频识别基准上的大量实验表明，ZeroI2V能够匹配甚至超越先前的最先进方法，同时具备更优的参数与推理效率。