Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
翻译:自监督学习(SSL)在多种语音任务中已展现出实用价值。然而,这类方法在数据量、内存和计算资源方面通常要求极高。基于随机投影量化器的BERT语音预训练(BEST-RQ)是一种SSL方法,其在自动语音识别(ASR)任务上表现出色,且相较于wav2vec 2.0等其他SSL方法更为简洁。尽管BEST-RQ性能卓越,但原始文献缺乏细节描述(例如预训练所使用的GPU/TPU时长),且尚无官方提供的易用开源实现。此外,除ASR与语音翻译外,BEST-RQ尚未在其他下游任务中得到评估。本研究实现了随机投影量化器的复现版本,并初步将其与wav2vec 2.0在四项下游任务中进行对比。我们详细讨论了实现细节与差异,结果表明:随机投影量化器在达到与wav2vec 2.0相近的下游任务性能的同时,可将训练时间降低超过一半。