Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization

翻译：非对称层次锚定用于视听联合表征：解决信息分配模糊性以实现鲁棒的跨模态泛化

Bixing Wu,Yuhong Zhao,Zongli Ye,Jiachen Lian,Xiangyu Yue,Gopala Anumanchipalli

from arxiv, 18 pages, 11 figures

Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modalities. We propose Asymmetric Hierarchical Anchoring (AHA), which enforces directional information allocation by designating a structured semantic anchor within a shared hierarchy. In our instantiation, we exploit the hierarchical discrete representations induced by audio Residual Vector Quantization (RVQ) to guide video feature distillation into a shared semantic space. To ensure representational purity, we replace fragile mutual information estimators with a GRL-based adversarial decoupler that explicitly suppresses semantic leakage in modality-specific branches, and introduce Local Sliding Alignment (LSA) to encourage fine-grained temporal alignment across modalities. Extensive experiments on AVE and AVVP benchmarks demonstrate that AHA consistently outperforms symmetric baselines in cross-modal transfer. Additional analyses on talking-face disentanglement experiment further validate that the learned representations exhibit improved semantic consistency and disentanglement, indicating the broader applicability of the proposed framework.

翻译：跨模态泛化（CMG）下的视听联合表征学习旨在通过统一的离散表征空间，将知识从有标注的源模态迁移到无标注的目标模态。现有的对称框架常受信息分配模糊性困扰，即结构归纳偏置的缺失导致语义特异性信息在模态间泄漏。我们提出了非对称层次锚定（AHA），该方法通过在共享层次结构中指定一个结构化的语义锚点，以强制实现定向信息分配。在我们的具体实现中，我们利用音频残差向量量化（RVQ）诱导的层次化离散表征，来指导视频特征蒸馏到一个共享的语义空间中。为确保表征的纯净性，我们用基于梯度反转层（GRL）的对抗解耦器替代了脆弱的互信息估计器，以显式抑制模态特定分支中的语义泄漏，并引入了局部滑动对齐（LSA）来促进跨模态的细粒度时间对齐。在AVE和AVVP基准测试上进行的大量实验表明，AHA在跨模态迁移任务中持续优于对称基线方法。在说话人脸解耦实验上的进一步分析验证了所学表征具有改进的语义一致性和解耦性，表明了所提框架具有更广泛的适用性。