Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity. While various attempts have been made to extend this context, this has often come at the cost of both conceptual and computational complexity. We propose to instead re-purpose existing pre-trained video transformers by simply fine-tuning them to attend to memories derived non-parametrically from past activations. By leveraging redundancy reduction, our memory-consolidated vision transformer (MC-ViT) effortlessly extends its context far into the past and exhibits excellent scaling behavior when learning from longer videos. In doing so, MC-ViT sets a new state-of-the-art in long-context video understanding on EgoSchema, Perception Test, and Diving48, outperforming methods that benefit from orders of magnitude more parameters.
翻译:大多数基于Transformer的视频编码器由于二次复杂度限制,只能处理短时间上下文。尽管已有各种尝试扩展这一上下文,但往往以增加概念和计算复杂度为代价。我们提出另一种方法:通过对现有预训练视频Transformer进行简单微调,使其能够注意力机制关注从过去激活中非参数化导出的记忆。通过利用冗余减少,我们的记忆整合视觉Transformer(MC-ViT)能够轻松将其上下文扩展至遥远过去,并在学习较长视频时展现出卓越的缩放行为。由此,MC-ViT在EgoSchema、Perception Test和Diving48数据集上创造了长上下文视频理解的新纪录,其性能超越了参数数量多出数个数量级的方法。