Weakly-supervised temporal action localization (WTAL) is a practical yet challenging task. Due to large-scale datasets, most existing methods use a network pretrained in other datasets to extract features, which are not suitable enough for WTAL. To address this problem, researchers design several modules for feature enhancement, which improve the performance of the localization module, especially modeling the temporal relationship between snippets. However, all of them neglect the adverse effects of ambiguous information, which would reduce the discriminability of others. Considering this phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections, preventing the transmission of ambiguous information and enhancing the discriminability of snippet-level representations. Additionally, we propose feature consistency loss to prevent the assimilation of features and drive the graph convolution network to generate more discriminative representations. Extensive experiments on THUMOS14 and ActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net, establishing new state-of-the-art results on both datasets. Source code is available at \url{https://github.com/XiaojunTang22/ICCV2023-DDGNet}.
翻译:弱监督时序动作定位(WTAL)是一项实用但具有挑战性的任务。由于大规模数据集的存在,现有方法大多使用在其他数据集上预训练的网络提取特征,这些特征对WTAL不够适配。针对该问题,研究者设计了若干特征增强模块,通过建模片段间时序关系提升了定位模块的性能。然而,现有方法均忽略了模糊信息产生的负面影响——这类信息会降低其他特征的判别性。基于这一发现,我们提出判别性驱动图网络(DDG-Net),通过精心设计的连接显式建模模糊片段与判别性片段,从而阻断模糊信息的传播并增强片段级表示的判别性。此外,我们提出特征一致性损失以抑制特征同化,促使图卷积网络生成更具判别性的表示。在THUMOS14和ActivityNet1.2基准上的大量实验验证了DDG-Net的有效性,其在两个数据集上均取得了新的最优结果。源代码已开源至\url{https://github.com/XiaojunTang22/ICCV2023-DDGNet}。