Audio-visual Zero-Shot Learning (ZSL) has attracted significant attention for its ability to identify unseen classes and perform well in video classification tasks. However, modal imbalance in (G)ZSL leads to over-reliance on the optimal modality, reducing discriminative capabilities for unseen classes. Some studies have attempted to address this issue by modifying parameter gradients, but two challenges still remain: (a) Quality discrepancies, where modalities offer differing quantities and qualities of information for the same concept. (b) Content discrepancies, where sample contributions within a modality vary significantly. To address these challenges, we propose a Discrepancy-Aware Attention Network (DAAN) for Enhanced Audio-Visual ZSL. Our approach introduces a Quality-Discrepancy Mitigation Attention (QDMA) unit to minimize redundant information in the high-quality modality and a Contrastive Sample-level Gradient Modulation (CSGM) block to adjust gradient magnitudes and balance content discrepancies. We quantify modality contributions by integrating optimization and convergence rate for more precise gradient modulation in CSGM. Experiments demonstrates DAAN achieves state-of-the-art performance on benchmark datasets, with ablation studies validating the effectiveness of individual modules.
翻译:视听零样本学习(ZSL)因其能够识别未见类别并在视频分类任务中表现优异而受到广泛关注。然而,(广义)零样本学习中的模态不平衡会导致模型过度依赖最优模态,从而降低对未见类别的判别能力。已有研究尝试通过调整参数梯度来解决这一问题,但仍面临两个挑战:(a)质量差异:不同模态为同一概念提供的信息在数量和质量上存在差异;(b)内容差异:同一模态内不同样本的贡献度差异显著。为应对这些挑战,我们提出了一种基于差异感知注意力网络(DAAN)的增强型视听零样本学习方法。该方法引入了质量差异缓解注意力(QDMA)单元以减少高质量模态中的冗余信息,并设计了对比样本级梯度调制(CSGM)模块来调整梯度幅度以平衡内容差异。我们通过整合优化过程与收敛速率来量化模态贡献度,从而在CSGM中实现更精确的梯度调制。实验表明,DAAN在基准数据集上取得了最先进的性能,消融研究验证了各独立模块的有效性。