Temporal repetition counting aims to quantify the repeated action cycles within a video. The majority of existing methods rely on the similarity correlation matrix to characterize the repetitiveness of actions, but their scalability is hindered due to the quadratic computational complexity. In this work, we introduce a novel approach that employs an action query representation to localize repeated action cycles with linear computational complexity. Based on this representation, we further develop two key components to tackle the essential challenges of temporal repetition counting. Firstly, to facilitate open-set action counting, we propose the dynamic update scheme on action queries. Unlike static action queries, this approach dynamically embeds video features into action queries, offering a more flexible and generalizable representation. Secondly, to distinguish between actions of interest and background noise actions, we incorporate inter-query contrastive learning to regularize the video representations corresponding to different action queries. As a result, our method significantly outperforms previous works, particularly in terms of long video sequences, unseen actions, and actions at various speeds. On the challenging RepCountA benchmark, we outperform the state-of-the-art method TransRAC by 26.5% in OBO accuracy, with a 22.7% mean error decrease and 94.1% computational burden reduction. Code is available at https://github.com/lizishi/DeTRC.
翻译:时间重复计数旨在量化视频中重复执行的动作周期。现有方法大多依赖相似性相关矩阵表征动作的重复性,但其二次计算复杂度限制了可扩展性。本研究提出一种基于动作查询表示的新方法,以线性计算复杂度定位重复动作周期。基于该表示,我们进一步开发了两个关键组件以解决时间重复计数的核心挑战:首先,为促进开放集动作计数,我们提出动作查询的动态更新机制——不同于静态动作查询,该方法将视频特征动态嵌入到动作查询中,从而提供更灵活且可泛化的表示;其次,为区分目标动作与背景噪声动作,我们引入跨查询对比学习以正则化不同动作查询对应的视频表征。实验表明,本方法在长视频序列、未见动作及多速率动作等场景中显著超越现有方法。在具有挑战性的RepCountA基准上,我们以22.7%的平均误差降低和94.1%的计算开销缩减,将OBO准确率较当前最优方法TransRAC提升26.5%。代码已开源:https://github.com/lizishi/DeTRC。