Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity. While various attempts have been made to extend this context, this has often come at the cost of both conceptual and computational complexity. We propose to instead re-purpose existing pre-trained video transformers by simply fine-tuning them to attend to memories derived non-parametrically from past activations. By leveraging redundancy reduction, our memory-consolidated vision transformer (MC-ViT) effortlessly extends its context far into the past and exhibits excellent scaling behavior when learning from longer videos. In doing so, MC-ViT sets a new state-of-the-art in long-context video understanding on EgoSchema, Perception Test, and Diving48, outperforming methods that benefit from orders of magnitude more parameters.
翻译:大多数基于Transformer的视频编码器因其二次复杂度而局限于短时上下文。尽管已有多种尝试扩展这一上下文,但这往往以概念和计算复杂度的增加为代价。我们提出重新利用现有的预训练视频Transformer,仅通过微调使其关注从历史激活中非参数化衍生的记忆。通过利用冗余缩减,我们的记忆融合视觉Transformer(MC-ViT)能够轻松地将上下文扩展至更久远的过去,并在从更长视频中学习时展现出优异的扩展性能。由此,MC-ViT在EgoSchema、Perception Test和Diving48数据集上实现了长上下文视频理解的新最优性能,其表现优于参数规模高出数个数量级的方法。