Audio-visual Generalized Zero-shot Learning (AV-GZSL) is a challenging task that aims to classify both seen and unseen objects or scenes by integrating data from audio and visual modalities. Recent studies primarily focus on fusing or aligning audio and visual features to generate more informative audio-visual embeddings. Also, aligning the audio-visual and textual features of most existing methods relies solely on the optimization objectives. However, those methods neglect the inherent distributional and structural differences between audio-visual and textual modalities. To address this limitation, we propose a method termed Aligning Hierarchical Standardized Embedding (AHSE), which enables hierarchical alignment of standardized audio-visual and textual embeddings within a shared embedding space. Specifically, we first apply Z-score standardization to the fused audio-visual and textual embeddings to reduce distributional mismatches. We then introduce a hierarchical alignment strategy that minimizes discrepancies at the semantic, class, and batch levels, thereby constructing a more robust and well-structured embedding space. This strategy not only preserves semantic and inter-class relationships but also maintains spatial consistency within each batch. Extensive experiments on three benchmark datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL, demonstrate that AHSE achieves competitive performance in zero-shot learning.
翻译:音视频广义零样本学习(AV-GZSL)是一项具有挑战性的任务,旨在通过整合音频和视觉模态的数据,对已知和未知的物体或场景进行分类。近期研究主要聚焦于融合或对齐音频与视觉特征,以生成更具信息量的音视频嵌入。此外,现有方法大多仅依赖优化目标来对齐音视频与文本特征。然而,这些方法忽视了音视频与文本模态之间固有的分布和结构差异。为克服这一局限,我们提出了一种名为“对齐层级标准化嵌入(AHSE)”的方法,该方法能够在共享嵌入空间中实现对标准化音视频与文本嵌入的层级对齐。具体而言,我们首先对融合后的音视频与文本嵌入进行Z-score标准化,以减少分布不匹配。随后引入层级对齐策略,在语义、类别和批次三个层面最小化差异,从而构建更稳健且结构清晰的嵌入空间。该策略不仅保留了语义和类间关系,还维持了每个批次内的空间一致性。在VGGSound-GZSL、UCF-GZSL和ActivityNet-GZSL三个基准数据集上的大量实验表明,AHSE在零样本学习中取得了竞争性性能。