For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
翻译:在许多自动语音识别任务中,频谱图形式的音频特征比梅尔频率倒谱系数表现出更优的性能,但由于特征空间维度复杂,实际应用存在困难。本文提出一种基于卷积变分自编码器的频谱图压缩表征生成方法。通过在LibriSpeech数据集的子样本上训练卷积变分自编码器模型,实现了从13维嵌入向量重构短时音频频谱图片段。利用训练完成的40维嵌入模型,在GoogleSpeechCommands数据集的语音指令语料库上生成特征向量。基于生成特征构建的自动语音识别系统与采用梅尔频率倒谱系数特征的模型进行了对比评估。