Given an untrimmed video, repetitive actions counting aims to estimate the number of repetitions of class-agnostic actions. To handle the various length of videos and repetitive actions, also optimization challenges in end-to-end video model training, down-sampling is commonly utilized in recent state-of-the-art methods, leading to ignorance of several repetitive samples. In this paper, we attempt to understand repetitive actions from a full temporal resolution view, by combining offline feature extraction and temporal convolution networks. The former step enables us to train repetition counting network without down-sampling while preserving all repetition regardless of the video length and action frequency, and the later network models all frames in a flexible and dynamically expanding temporal receptive field to retrieve all repetitions with a global aspect. We experimentally demonstrate that our method achieves better or comparable performance in three public datasets, i.e., TransRAC, UCFRep and QUVA. We expect this work will encourage our community to think about the importance of full temporal resolution.
翻译:给定一段未经裁剪的视频,重复动作计数的目标是估计与类别无关的动作重复次数。为应对视频和重复动作的不同长度,以及端到端视频模型训练中的优化挑战,当前最先进的方法普遍采用下采样策略,这导致部分重复样本被忽略。本文尝试从全时间分辨率视角理解重复动作,通过结合离线特征提取和时间卷积网络实现。前者使我们能够在不进行下采样的前提下训练重复计数网络,同时保留所有重复动作(无论视频长度和动作频率如何);后者则以灵活且动态扩展的时间感受野对所有帧进行建模,从全局角度检索所有重复。我们通过实验证明,该方法在TransRAC、UCFRep和QUVA三个公开数据集上取得了更优或可比的性能。我们期待这项工作能激励学界重视全时间分辨率的重要性。