Weakly-supervised temporal action localization (WTAL) is a practical yet challenging task. Due to large-scale datasets, most existing methods use a network pretrained in other datasets to extract features, which are not suitable enough for WTAL. To address this problem, researchers design several modules for feature enhancement, which improve the performance of the localization module, especially modeling the temporal relationship between snippets. However, all of them neglect the adverse effects of ambiguous information, which would reduce the discriminability of others. Considering this phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections, preventing the transmission of ambiguous information and enhancing the discriminability of snippet-level representations. Additionally, we propose feature consistency loss to prevent the assimilation of features and drive the graph convolution network to generate more discriminative representations. Extensive experiments on THUMOS14 and ActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net, establishing new state-of-the-art results on both datasets. Source code is available at \url{https://github.com/XiaojunTang22/ICCV2023-DDGNet}.
翻译:弱监督时序动作定位(WTAL)是一项实用但具有挑战性的任务。由于大规模数据集的存在,现有大多数方法使用在其他数据集上预训练的网络提取特征,这些特征并不完全适用于WTAL。为解决此问题,研究人员设计了多个特征增强模块,通过建模片段间的时序关系来提升定位模块的性能。然而,这些方法均忽略了模糊信息对其它片段可鉴别性造成的负面影响。针对这一现象,本文提出可鉴别性驱动图网络(DDG-Net),通过精心设计的连接显式建模模糊片段与可鉴别片段,既阻止模糊信息的传播,又增强片段级表示的可鉴别性。此外,我们提出特征一致性损失以抑制特征的同化作用,并驱动图卷积网络生成更具可鉴别性的表示。在THUMOS14与ActivityNet1.2基准上的大量实验表明,DDG-Net的有效性,并在两个数据集上均取得了新的最佳结果。源代码发布于\url{https://github.com/XiaojunTang22/ICCV2023-DDGNet}。