Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.
翻译:视觉Transformer在一系列视觉识别任务中取得了令人印象深刻的准确率。然而,其高准确性往往伴随着高昂的计算成本。这在视频识别中尤为突出,因为模型通常需要在帧或时间块上反复应用。在本研究中,我们利用后续输入之间的时间冗余来降低Transformer在视频处理中的计算成本。我们描述了一种方法,用于识别并仅重新处理随时间显著变化的令牌。我们提出的模型系列——事件性Transformer,可从现有Transformer转换而来(通常无需重新训练),并在运行时对计算成本提供自适应控制。我们在大规模视频目标检测(ImageNet VID)和动作识别(EPIC-Kitchens 100)数据集上评估了该方法。我们的方法实现了显著的计算节省(约2-4倍),且准确率仅略有下降。