Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is challenging and computationally expensive. Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named "LASER: Learning by Aligning Self-supervised Representations" is presented. LASER is based on the soft-DTW alignment loss with temporal regularisation term. Experiments are conducted with HuBERT and WavLM models and evaluated on the SUPERB benchmark for two content-related tasks: automatic speech recognition (ASR) and phoneme recognition (PR). A relative improvement of 3.7% and 8.2% for HuBERT, and 4.1% and 11.7% for WavLM are observed, for the ASR and PR tasks respectively, with only < 3 hours of fine-tuning on a single GPU.
翻译:基于自监督学习(SSL)的语音模型被广泛应用于全栈语音处理。然而,研究发现,针对内容相关任务,利用无标签语音改进基于SSL的语音表征具有挑战性且计算成本高昂。近期已有研究尝试通过成本效益高的自监督微调(SSFT)方法来解决这一问题。沿此方向,本文提出了一种名为“LASER:通过自监督表征对齐学习”的成本效益型SSFT方法。LASER基于带有时间正则化项的软DTW对齐损失。实验采用HuBERT和WavLM模型,并在SUPERB基准上针对两个内容相关任务进行评估:自动语音识别(ASR)和音素识别(PR)。实验结果表明,仅使用单GPU进行不足3小时的微调,HuBERT模型在ASR和PR任务上分别实现了3.7%和8.2%的相对性能提升,WavLM模型则分别实现了4.1%和11.7%的相对性能提升。