Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual-textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.
翻译:开放词汇时间动作检测(OV-TAD)旨在对未修剪视频中未见类别的动作片段进行分类和定位。现有方法仅依赖标签级语义与视觉特征之间的全局对齐,这不足以将时间一致的视觉知识从已见类别迁移到未见类别。为解决这一问题,我们提出了一种分阶段分解与对齐(PDA)框架,该框架能够实现细粒度动作模式学习,从而有效迁移先验知识。具体而言,我们首先引入了CoT提示语义分解(CSD)模块,该模块利用大语言模型的链式思维(CoT)推理能力,自动将动作标签分解为连贯的阶段级描述,模拟人类认知过程。随后,引入文本融合前景过滤(TIF)模块,利用阶段级语义线索自适应过滤每个阶段的动作相关片段,生成语义对齐的视觉表示。此外,我们提出自适应阶段级对齐(APA)模块,执行阶段级视觉-文本匹配,并跨阶段自适应聚合对齐结果以进行最终预测。这种自适应阶段级对齐有助于捕获可迁移的动作模式,显著增强对未见动作的泛化能力。在两个OV-TAD基准上的大量实验证明了所提方法的优越性。