Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of these approaches are computationally intensive due to using transformer encoder and lack of sub-sampling. In this paper, we propose a new self-supervised learning model termed as Neural Encoder for Self-supervised Training (NEST). Specifically, we adopt the FastConformer architecture, which has an 8x sub-sampling rate and is faster than Transformer or Conformer architectures. Instead of clustering-based token generation, we resort to fixed random projection for its simplicity and effectiveness. We also propose a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that the proposed NEST model improves over existing self-supervised models on a variety of speech processing tasks. Code and checkpoints will be publicly available via NVIDIA NeMo toolkit.
翻译:自监督学习已被证明对广泛的语音处理任务有益,例如语音识别/翻译、说话人验证和语音分割等。然而,由于使用Transformer编码器且缺乏子采样,大多数此类方法计算密集。本文提出了一种新的自监督学习模型,称为神经编码器自监督训练(NEST)。具体而言,我们采用FastConformer架构,该架构具有8倍子采样率,且比Transformer或Conformer架构更快。我们放弃了基于聚类的令牌生成方法,转而采用固定随机投影,因其简单且有效。我们还提出了一种广义噪声语音增强方法,以教导模型从噪声或其他说话人中分离出主要说话人。实验表明,所提出的NEST模型在多种语音处理任务上优于现有的自监督模型。代码和检查点将通过NVIDIA NeMo工具包公开提供。