Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints are publicly available via NVIDIA NeMo framework.
翻译:自监督学习已被证明能广泛惠及语音处理任务,如语音识别/翻译、说话人验证与分离等。然而,当前大多数方法计算成本高昂。本文提出一种简化且更高效的自监督学习框架,称为NeMo语音任务编码器(NEST)。具体而言,我们采用具有8倍降采样率的FastConformer架构,其速度优于Transformer或Conformer架构。我们使用固定的随机投影替代基于聚类的量化方法,因其简洁有效。我们还实现了一种广义含噪语音增强技术,使模型能够从噪声或其他说话人中分离出主要说话人。实验表明,\model在多种语音处理任务(如语音识别/翻译、说话人分离、口语理解等)上超越了现有自监督模型,并取得了新的最先进性能。代码与检查点已通过NVIDIA NeMo框架公开提供。