Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs

Learning robust audio-visual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences - background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled "train" might also contain motorcycle audio and visual, because "motorcycle" is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio-Visual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., "Train (visual)" -> "Motorcycle (audio)") that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (mAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.

翻译：学习鲁棒的音频-视觉嵌入需要将真正相关的音频与视觉信号拉近，同时过滤掉偶然共现的背景噪声、无关元素或未标注事件。大多数基于对比损失和三重损失的方法仅使用每个片段的稀疏标注标签，并将任何共现视为语义相似性。例如，一个标注为“火车”的视频可能同时包含摩托车的音频和视觉内容，但由于“摩托车”未被选为标注标签，标准方法会将这些共现视为其他真实摩托车锚点的负样本，从而产生假阴性并遗漏真实的跨模态依赖关系。我们提出一个利用软标签预测与推断潜在交互来解决这些问题的框架：（1）音频-视觉语义对齐损失（AV-SAL）训练一个教师网络，以生成跨模态对齐的软标签分布，为共现但未标注的事件分配非零概率，从而丰富监督信号。（2）推断潜在交互图（ILI）将GRaSP算法应用于教师网络的软标签，以推断类别间稀疏的有向依赖图。该图突显了方向性依赖关系（例如“火车（视觉）”→“摩托车（音频）”），揭示了类别间可能的语义或条件关系；这些关系被解释为估计的依赖模式。（3）潜在交互正则化器（LIR）：学生网络通过度量损失和由ILI图引导的正则化器进行联合训练，按照软标签概率的比例拉近具有依赖关联但未标注的样本对的嵌入表示。在AVE和VEGAS基准测试上的实验显示，平均精度均值（mAP）获得持续提升，表明将推断的潜在交互整合到嵌入学习中能增强模型的鲁棒性与语义一致性。