Most state-of-the-art spoken language identification models are closed-set; in other words, they can only output a language label from the set of classes they were trained on. Open-set spoken language identification systems, however, gain the ability to detect when an input exhibits none of the original languages. In this paper, we implement a novel approach to open-set spoken language identification that uses MFCC and pitch features, a TDNN model to extract meaningful feature embeddings, confidence thresholding on softmax outputs, and LDA and pLDA for learning to classify new unknown languages. We present a spoken language identification system that achieves 91.76% accuracy on trained languages and has the capability to adapt to unknown languages on the fly. To that end, we also built the CU MultiLang Dataset, a large and diverse multilingual speech corpus which was used to train and evaluate our system.
翻译:当前最先进的口语语言识别模型大多为封闭集模型,即仅能输出训练类别集合中的语言标签。而开放集口语语言识别系统则具备检测输入语音中是否包含非训练语言的能力。本文提出一种新颖的开放集口语语言识别方法,该方法融合梅尔频率倒谱系数(MFCC)与基频特征、基于时延神经网络(TDNN)提取有效特征嵌入、对softmax输出进行置信度阈值化处理,并采用线性判别分析(LDA)与概率线性判别分析(pLDA)实现未知语言的增量分类。实验表明,该系统对已训练语言的识别准确率达91.76%,且具备实时适配未知语言的能力。为此,我们构建了CU多语言数据集(CU MultiLang Dataset),这是一个大规模、多语种的语音语料库,用于本系统的训练与评估。