We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The performance is also compared to a prior repetition counting model. The dataset is available for download at: https://sites.google.com/view/openvocabreps/
翻译:我们提出了一个视频时序重复标注数据集。该数据集命名为OVR(发音同"over"),包含超过7.2万条视频标注,每条标注均明确标注了重复次数、重复片段的起止时间,并对重复内容提供了自由形式的描述。标注视频来源于Kinetics和Ego4D数据集,因此同时涵盖了外视(Exo)与第一人称(Ego)两种视角条件,并包含了极其丰富多样的动作与活动类别。此外,OVR的规模较以往视频重复数据集提升了近一个数量级。我们还提出了一个基于Transformer的基准计数模型OVRCounter,该模型能够对最长320帧的视频进行重复动作的定位与计数。我们在OVR数据集上对该模型进行了训练与评估,并分别测试了使用文本指定计数目标类别与不使用文本条件下的性能表现。同时,我们将该模型与先前的重复计数模型进行了性能对比。本数据集可通过以下链接下载:https://sites.google.com/view/openvocabreps/