Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This contextual data enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer. We also propose an approach that harnesses Stories to improve feature generation for training zero-shot classification. Without any target dataset fine-tuning, our method achieves new state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain. The data can be found here: https://github.com/kini5gowda/Stories .
翻译:视频理解长期受限于对大规模标注数据集的依赖,这推动了零样本学习的研究。语言模型的最新进展为零样本视频分析提供了机遇,但构建与动作类别相关的有效语义空间仍具挑战。为解决此问题,我们引入了一个名为Stories的新数据集,其中包含从WikiHow文章中提取的多样动作类别的丰富文本描述。针对每个类别,我们提取多句叙述,详细说明动作所需的步骤、场景、物体和动词。这些上下文数据能够建模动作间的细微关联,为零样本迁移铺平道路。我们还提出了一种利用Stories改进零样本分类训练特征生成的方法。无需对目标数据集进行微调,我们的方法在多个基准测试中达到了新的最优水平,top-1准确率提升高达6.1%。我们相信Stories能够成为推动零样本动作识别进展的宝贵资源。这些文本叙事建立了可见类别与不可见类别之间的联系,突破了长期阻碍这一领域发展的标注数据瓶颈。数据可在此处获取:https://github.com/kini5gowda/Stories。