Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.

翻译：大型预训练视觉模型在计算机视觉中取得了令人瞩目的成功。然而，针对下游任务（特别是视频理解）对大型模型进行全参数微调，其计算成本可能高得难以承受。近期研究转向关注高效的图像到视频迁移学习。然而，现有的高效微调方法缺乏对训练内存使用的关注，且未探索如何将更大规模模型迁移至视频领域。本文提出一种新颖的时空侧网络（Side4Video），用于对大型图像模型进行内存高效的微调以实现视频理解。具体而言，我们引入一个轻量级的时空侧网络附加在冻结的视觉模型上，该网络避免了对庞大预训练模型进行反向传播，并充分利用原始图像模型的多级空间特征。这种极高内存效率的架构使我们的方法相比以往的适配器方法可减少75%的内存使用。凭借此方法，我们能够将比ViT-L（304M）大14倍的ViT-E（4.4B）模型迁移至视频理解任务。我们的方法在单模态与跨模态任务（即动作识别与文本-视频检索）的多个视频数据集上取得了卓越性能，尤其在Something-Something V1&V2（67.3%和74.6%）、Kinetics-400（88.6%）、MSR-VTT（52.3%）、MSVD（56.1%）及VATEX（68.8%）上表现突出。代码已开源在 https://github.com/HJYao00/Side4Video。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/