The vocabulary size in temporal action localization (TAL) is constrained by the scarcity of large-scale annotated datasets. To address this, recent works incorporate powerful pre-trained vision-language models (VLMs), such as CLIP, to perform open-vocabulary TAL (OV-TAL). However, unlike VLMs trained on extensive image/video-text pairs, existing OV-TAL methods still rely on small, fully labeled TAL datasets for training an action localizer. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our self-training approach consists of two stages. First, a class-agnostic action localizer is trained on a human-labeled TAL dataset and used to generate pseudo-labels for unlabeled videos. Second, the large-scale pseudo-labeled dataset is combined with the human-labeled dataset to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally, we highlighted issues with existing OV-TAL evaluation schemes and proposed a new evaluation protocol. Code is released at https://github.com/HYUNJS/STOV-TAL
翻译:时序动作定位(TAL)的词汇规模受限于大规模标注数据集的稀缺性。为解决此问题,近期研究引入强大的预训练视觉-语言模型(VLMs),例如CLIP,以执行开放词汇时序动作定位(OV-TAL)。然而,与在大量图像/视频-文本对上进行训练的VLMs不同,现有的OV-TAL方法在训练动作定位器时,仍然依赖于小型、全标注的TAL数据集。本文探索了利用未标注的YouTube视频进行自训练以提升OV-TAL可扩展性的方法。我们的自训练方法包含两个阶段。首先,在人工标注的TAL数据集上训练一个类别无关的动作定位器,并用于为未标注视频生成伪标签。其次,将大规模伪标签数据集与人工标注数据集结合,共同训练定位器。大量实验表明,在自训练中利用网络规模的视频能显著提升动作定位器的泛化能力。此外,我们指出了现有OV-TAL评估方案存在的问题,并提出了一种新的评估协议。代码发布于 https://github.com/HYUNJS/STOV-TAL