Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data. On the other hand, the use of weakly-supervised data is less explored for speech pre-training. To fill this gap, we propose a weakly-supervised speech pre-training method based on speaker-aware speech data. It adopts a similar training procedure to the widely-used masked speech prediction based SSL framework, while incorporating additional target-speaker enrollment information as an auxiliary input. In this way, the learned representation is steered towards the target speaker even in the presence of highly overlapping interference, allowing potential applications to tasks such as target speech recognition. Our experiments on Libri2Mix and WSJ0-2mix datasets show that the proposed model achieves significantly better ASR performance compared to WavLM, the state-of-the-art SSL model with denoising capability.
翻译:基于自监督学习的语音预训练因其能够从海量无标注数据中学习丰富的表征能力而备受关注。然而,弱监督数据在语音预训练中的应用尚待深入探索。为填补这一空白,我们提出了一种基于说话人感知语音数据的弱监督语音预训练方法。该方法采用与广泛使用的掩码语音预测自监督框架相似的训练流程,同时额外引入目标说话人注册信息作为辅助输入。通过这种方式,即使存在高度重叠的干扰语音,学习到的表征仍能朝向目标说话人方向优化,从而可应用于目标语音识别等任务。我们在Libri2Mix和WSJ0-2mix数据集上的实验表明,与具有去噪能力的最先进自监督模型WavLM相比,所提模型取得了显著更优的自动语音识别性能。