Automatic speech recognition (ASR) systems typically use handcrafted feature extraction pipelines. To avoid their inherent information loss and to achieve more consistent modeling from speech to transcribed text, neural raw waveform feature extractors (FEs) are an appealing approach. Also the wav2vec 2.0 model, which has recently gained large popularity, uses a convolutional FE which operates directly on the speech waveform. However, it is not yet studied extensively in the literature. In this work, we study its capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model and compare it to an alternative neural FE. We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components. Furthermore, we analyze the learned filters and show that the most important information for the ASR system is obtained by a set of bandpass filters.
翻译:自动语音识别系统通常采用手工设计的特征提取流程。为避免其固有的信息损失,并实现从语音到转录文本更一致的建模,神经原始波形特征提取器是一种颇具吸引力的方法。近期广受欢迎的wav2vec 2.0模型同样采用直接处理语音波形的卷积特征提取器,但现有文献对此研究尚不充分。本文研究该特征提取器在连接主义时序分类ASR模型中替代标准特征提取方法的能力,并将其与另一种神经特征提取器进行比较。结果表明,两种方法在LibriSpeech基准测试中均能与传统特征提取器相媲美,我们进一步分析了各组件的具体影响。此外,通过分析学习到的滤波器可知,ASR系统最关键的信息来源于一组带通滤波器。