Temporal modeling plays a crucial role in understanding video content. To tackle this problem, previous studies built complicated temporal relations through time sequence thanks to the development of computationally powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling. Specifically, we first capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features. Then, we extract corresponding features from these cues to benefit the original temporal-irrespective domain. We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which operates on the stem of a visual backbone with a plug-and-play style. We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost. Moreover, the ATM is compatible with both CNNs- and ViTs-based architectures. Our results show that ATM achieves superior performance over several popular video benchmarks. Specifically, on Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%, 74.6%, and 89.4% respectively. The code is available at https://github.com/whwu95/ATM.
翻译:时间建模在理解视频内容中起着关键作用。为解决这一问题,得益于高性能计算设备的发展,以往研究通过时间序列构建了复杂的时间关系。在本文中,我们探索了四种简单算术运算在时间建模中的潜力。具体地,我们首先通过计算成对提取帧特征之间的加法、减法、乘法和除法来捕获辅助时间线索;然后,从这些线索中提取相应特征,以增强原始与时间无关的域。我们将这一简单流程称为算术时间模块(ATM),它以即插即用的方式作用于视觉骨干网络的茎部。我们对ATM的实例化进行了全面的消融研究,结果表明该模块以较低的计算成本提供了强大的时间建模能力。此外,ATM兼容基于CNN和ViTs的架构。我们的结果显示,ATM在多个流行视频基准上取得了优越性能。具体地,在Something-Something V1、V2和Kinetics-400上,我们分别达到了65.6%、74.6%和89.4%的Top-1准确率。代码见https://github.com/whwu95/ATM。