Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data. On the other hand, the use of weakly-supervised data is less explored for speech pre-training. To fill this gap, we propose a weakly-supervised speech pre-training method based on speaker-aware speech data. It adopts a similar training procedure to the widely-used masked speech prediction based SSL framework, while incorporating additional target-speaker enrollment information as an auxiliary input. In this way, the learned representation is steered towards the target speaker even in the presence of highly overlapping interference, allowing potential applications to tasks such as target speech recognition. Our experiments on Libri2Mix and WSJ0-2mix datasets show that the proposed model achieves significantly better ASR performance compared to WavLM, the state-of-the-art SSL model with denoising capability.
翻译:基于自监督学习的语音预训练因其能够从海量未标注数据中提取丰富表征的能力而备受关注。然而,弱监督数据在语音预训练中的应用尚待深入探索。为填补这一空白,我们提出了一种基于说话人感知语音数据的弱监督语音预训练方法。该方法采用与广泛使用的掩码语音预测自监督学习框架相似的训练流程,同时引入额外的目标说话人注册信息作为辅助输入。通过这种方式,即使在高度重叠干扰存在的情况下,学习到的表征仍能向目标说话人方向引导,从而可应用于目标语音识别等任务。我们在Libri2Mix和WSJ0-2mix数据集上的实验表明,相较于具有去噪能力的先进自监督学习模型WavLM,所提模型在自动语音识别性能上取得了显著提升。