Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning. The joint training with SSL and VGS mechanisms provides the opportunity to utilize both unlabeled speech and speech-related visual information based on data availability. This has shown to enhance the quality of learned representations, especially at encoding semantic- and lexical-level knowledge. In this work, we further study the joint optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multi-task learning system. We explore a set of training scenarios to understand how speech representations are shared or transferred between the two tasks, and what is the optimal training strategy for cross-modal semantic retrieval and phoneme discrimination performance. As a result, we find that sequential training with wav2vec 2.0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms. However, the parallel SSL-VGS training reduces the effects of catastrophic forgetting when switching between optimization criteria. Moreover, the results suggest that phonemic representations learned through the VGS mechanism may generalize better across datasets compared to those learned with SSL.
翻译:基于自监督算法的语音表征学习已在多项下游任务中取得显著性能提升。近期研究将自监督学习(SSL)与视觉接地语音(VGS)处理机制相结合进行表征学习。这种联合训练机制可根据数据可用性,同时利用无标注语音数据及语音相关视觉信息。研究表明,该方式能有效提升学习表征的质量,特别是在编码语义级与词汇级知识方面。本研究进一步将基于wav2vec 2.0的SSL与基于Transformer的VGS构建为多任务学习系统进行联合优化。通过探索系列训练场景,我们旨在揭示语音表征在两个任务间的共享与迁移机制,并确定跨模态语义检索与音素辨别性能的最优训练策略。实验结果表明:相较于同步优化两种学习机制,采用wav2vec 2.0预训练后衔接VGS的序贯训练方式在视听检索任务上表现更优。然而,并行SSL-VGS训练可减轻优化准则切换时产生的灾难性遗忘效应。此外,研究结果提示,相较于SSL习得的音素表征,通过VGS机制学习的音素表征可能具有更强的跨数据集泛化能力。