Self-supervised learning (SSL) of speech has shown impressive results in speech-related tasks, particularly in automatic speech recognition (ASR). While most methods employ the output of intermediate layers of the SSL model as real-valued features for downstream tasks, there is potential in exploring alternative approaches that use discretized token sequences. This approach offers benefits such as lower storage requirements and the ability to apply techniques from natural language processing. In this paper, we propose a new protocol that utilizes discretized token sequences in ASR tasks, which includes de-duplication and sub-word modeling to enhance the input sequence. It reduces computational cost by decreasing the length of the sequence. Our experiments on the LibriSpeech dataset demonstrate that our proposed protocol performs competitively with conventional ASR systems using continuous input features, while reducing computational and storage costs.
翻译:语音自监督学习(SSL)在语音相关任务(尤其是自动语音识别,ASR)中展现出显著成果。现有方法多将SSL模型中间层的输出作为实值特征用于下游任务,但采用离散化词元序列的替代方案同样具有研究价值。此类方法不仅能降低存储需求,还可应用自然语言处理相关技术。本文提出一种面向ASR任务的离散化词元序列处理新方案,通过去重与子词建模优化输入序列,在缩短序列长度的同时降低计算成本。基于LibriSpeech数据集的实验表明,本方案在计算与存储成本均低于传统语音识别系统(采用连续特征输入)的前提下,仍能与之保持相当的识别性能。