Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i.e., fulfilling a downstream task that is very similar to the pretext task. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder. In particular, we propose two solutions to match the HuBERT output with the HiFiGAN input, by freezing one and fine-tuning the other, and vice versa. Performance of both approaches was assessed in single- and multi-speaker settings, for both informed and blind inpainting configurations (i.e., the position of the mask is known or unknown, respectively), with different objective metrics and a perceptual evaluation. Performances show that if both solutions allow to correctly reconstruct signal portions up to the size of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a more accurate signal reconstruction in the single-speaker setting case, while freezing it (and training the neural vocoder instead) is a better strategy when dealing with multi-speaker data.
翻译:大多数语音自监督学习(SSL)模型通过预测输入信号缺失部分的前置任务进行训练,这些缺失部分可以是未来片段(因果预测)或输入中任意位置的掩码片段(非因果预测)。习得的语音表征随后可高效迁移至下游任务(如自动语音识别或说话人识别)。本研究探讨了使用语音SSL模型进行语音修复,即从语音信号的上下文环境中重建缺失部分,这本质上是一项与前置任务高度相似的下游任务。为此,我们将SSL编码器(即HuBERT)与神经声码器(即HiFiGAN)相结合,后者充当解码器角色。我们特别提出两种匹配HuBERT输出与HiFiGAN输入的方案:冻结其中一方并微调另一方,反之亦然。两种方法在单说话人与多说话人场景、已知与未知掩码位置(即分别对应信息化修复与盲修复)的配置下进行了性能评估,采用多种客观指标与感知评价。实验结果表明:两种方案均能准确重建长达200毫秒(部分情况下可达400毫秒)的信号片段;在单说话人场景中,微调SSL编码器能实现更精确的信号重建;而在处理多说话人数据时,冻结SSL编码器(改为训练神经声码器)是更优策略。