The representation learning of speech, without textual resources, is an area of significant interest for many low resource speech applications. In this paper, we describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers. The learned "time-frequency" representations from the convolutional neural network (CNN) module are further processed with long short term memory (LSTM) layers which generate a contextual vector representation for every windowed segment. The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations. The targets consist of phoneme-like pseudo labels for each audio segment and these are generated with an iterative k-means algorithm. We explore techniques that improve the speaker invariance of the learned representations and illustrate the effectiveness of the proposed approach on two settings, i) completely unsupervised speech applications on the sub-tasks described as part of the ZeroSpeech 2021 challenge and ii) semi-supervised automatic speech recognition (ASR) applications on the TIMIT dataset and on the GramVaani challenge Hindi dataset. In these experiments, we achieve state-of-art results for various ZeroSpeech tasks. Further, on the ASR experiments, the HUC representations are shown to improve significantly over other established benchmarks based on Wav2vec, HuBERT and Best-RQ.
翻译:在缺乏文本资源的条件下,语音表征学习是众多低资源语音应用领域的重要研究方向。本文提出一种基于隐单元聚类(HUC)框架的原始音频自监督表征学习方法。模型输入为经加窗处理的音频样本,通过一维卷积层进行特征提取。从卷积神经网络(CNN)模块学习到的"时频"表征,进一步经长短期记忆(LSTM)层处理,为每个加窗片段生成上下文向量表示。通过HUC框架将表征归类为少量类音素单元,训练模型学习富含语义信息的语音表征。目标函数为每个音频片段的类音素伪标签,通过迭代k-means算法生成。我们探索了提升所学表征说话人不变性的技术,并在两种场景下验证了方法的有效性:(i) ZeroSpeech 2021挑战赛子任务中的完全无监督语音应用,以及(ii) TIMIT数据集和GramVaani挑战赛印地语数据集上的半监督自动语音识别(ASR)应用。实验表明,该方法在多项ZeroSpeech任务中取得最优结果。此外,在ASR实验中,HUC表征相较于基于Wav2vec、HuBERT和Best-RQ的现有基准方法展现出显著性能提升。