Audio Representation Learning by Distilling Video as Privileged Information

Deep audio representation learning using multi-modal audio-visual data often leads to a better performance compared to uni-modal approaches. However, in real-world scenarios both modalities are not always available at the time of inference, leading to performance degradation by models trained for multi-modal inference. In this work, we propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft-labels generated by the teacher, in our proposed method we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into multiple segments throughout time, and non-sequential data where the entire features are treated as one whole segment. In the non-sequential setting both the teacher and student networks are comprised of an encoder component and a task header. We use the embeddings produced by the encoder component of the teacher to train the encoder of the student, while the task header of the student is trained using ground-truth labels. In the sequential setting, the networks have an additional aggregation component that is placed between the encoder and task header. We use two sets of embeddings produced by the encoder and aggregation component of the teacher to train the student. Similar to the non-sequential setting, the task header of the student network is trained using ground-truth labels. We test our framework on two different audio-visual tasks, namely speaker recognition and speech emotion recognition and show considerable improvements over sole audio-based recognition as well as prior works that use LUPI.

翻译：基于多模态音视频数据的深度音频表示学习通常比单模态方法表现更优。然而在实际应用场景中，推理阶段两种模态并非同时可用，导致为多模态推理训练的模型性能下降。本文针对推理阶段缺失视频模态的情况，提出一种利用音视频数据进行深度音频表示学习的新方法。为此，我们在特权信息学习范式下采用师生知识蒸馏框架。不同于以往基于特权信息学习的方法使用教师网络生成的软标签，本文提出的方法利用教师网络学习的嵌入表征来训练学生网络。我们将该方法集成到两种不同场景中：特征被划分为多个时序片段的序列数据，以及将整体特征视为单个片段的非序列数据。在非序列场景中，教师网络与学生网络均由编码器组件和任务头组成。我们使用教师网络编码器产生的嵌入表征训练学生网络的编码器，而学生网络的任务头则使用真实标签进行训练。在序列场景中，网络在编码器与任务头之间增加了聚合组件。我们分别使用教师网络编码器和聚合组件产生的两组嵌入表征训练学生网络，与序列场景相同，学生网络的任务头同样使用真实标签进行训练。我们在说话人识别和语音情感识别这两个音视频任务上测试了所提框架，实验结果表明该方法相比纯音频识别方法及现有基于特权信息学习的方法均有显著性能提升。