Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous methods pay much attention to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, principle analysis indicates that the multimodal confusing calculation will hamper the precise measurement of multimodal imbalanced feature learning, which further weakens the effectiveness of our DGM mechanism. To cope with this issue, a modality-separated decision unit (MSDU) is designed for more precise measurement of imbalanced feature learning between audio and visual modalities. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.
翻译:弱监督音视频解析(WS-AVVP)旨在仅利用视频级类别标签进行训练,定位音频、视觉及音视频事件实例的时间范围,并识别相应的事件类别。以往方法大多侧重于优化每个模态的监督信号或提取丰富的跨模态信息以实现更可靠的特征学习,但均未注意到该任务中不同模态间的特征学习不平衡问题。为平衡不同模态的特征学习过程,本文探索了一种动态梯度调制(DGM)机制,其中设计了一种新颖有效的度量函数来衡量音频与视觉模态之间的特征学习不平衡性。进一步的理论分析表明,多模态混淆计算会阻碍对多模态不平衡特征学习的精确度量,从而削弱DGM机制的有效性。针对此问题,本文提出了一种模态分离决策单元(MSDU),用于更精确地度量音频与视觉模态间的特征学习不平衡性。在公共基准数据集上开展了全面实验,相应实验结果证明了所提方法的有效性。