Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilities for understanding spatial contents, but naively transferring such models to video recognition still suffers from unsatisfactory temporal modeling capabilities. Existing methods insert tunable structures into or in parallel with the pre-trained model, which either requires back-propagation through the whole pre-trained model and is thus resource-demanding, or is limited by the temporal reasoning capability of the pre-trained structure. In this work, we present DiST, which disentangles the learning of spatial and temporal aspects of videos. Specifically, DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder, and a lightweight network is introduced as the temporal encoder. An integration branch is inserted between the encoders to fuse spatio-temporal information. The disentangled spatial and temporal learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Meanwhile, we empirically show that disentangled learning with an extra network for integration benefits both spatial and temporal understanding. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve 89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST. Codes and models can be found in https://github.com/alibaba-mmai-research/DiST.
翻译:最近,像CLIP这样的大规模预训练语言-图像模型展现了理解空间内容的非凡能力,但将此类模型直接迁移到视频识别仍受限于其时间建模能力不足。现有方法通过在预训练模型中插入或并行添加可调结构,但这要么需要反向传播整个预训练模型而资源消耗巨大,要么受限于预训练结构本身的时间推理能力。本文提出DiST,通过解耦视频的空间与时间学习来实现高效迁移。具体而言,DiST采用双编码器结构,以预训练基础模型作为空间编码器,并引入轻量级网络作为时间编码器。在两个编码器之间插入融合分支以整合时空信息。DiST中的解耦时空学习极为高效,因为它避免了大规模预训练参数的反向传播。同时,我们通过实验证明,结合额外网络的解耦学习能同时提升空间与时间理解能力。在五个基准数据集上的大量实验表明,DiST以显著优势超越现有最先进方法。当在大型Kinetics-710数据集上预训练时,我们在Kinetics-400上使用冻结的ViT-L模型达到89.7%的准确率,验证了DiST的可扩展性。代码与模型已开源于https://github.com/alibaba-mmai-research/DiST。