Language Identification (LID) is a crucial preliminary process in the field of Automatic Speech Recognition (ASR) that involves the identification of a spoken language from audio samples. Contemporary systems that can process speech in multiple languages require users to expressly designate one or more languages prior to utilization. The LID task assumes a significant role in scenarios where ASR systems are unable to comprehend the spoken language in multilingual settings, leading to unsuccessful speech recognition outcomes. The present study introduces convolutional recurrent neural network (CRNN) based LID, designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples. Furthermore, we replicate certain state-of-the-art methodologies, specifically the Convolutional Neural Network (CNN) and Attention-based Convolutional Recurrent Neural Network (CRNN with attention), and conduct a comparative analysis with our CRNN-based approach. We conducted comprehensive evaluations on thirteen distinct Indian languages and our model resulted in over 98\% classification accuracy. The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar. The proposed LID model exhibits a high degree of extensibility to additional languages and demonstrates a strong resistance to noise, achieving 91.2% accuracy in a noisy setting when applied to a European Language (EU) dataset.
翻译:语言识别(LID)是自动语音识别(ASR)领域中一项关键的预处理过程,涉及从音频样本中识别口语语言。当前能够处理多种语言语音的系统要求用户在使用前明确指定一种或多种语言。在ASR系统无法理解多语言环境中口语语言、导致语音识别结果失败的场景中,LID任务发挥着重要作用。本研究提出了一种基于卷积循环神经网络(CRNN)的语言识别方法,该方法针对音频样本的梅尔频率倒谱系数(MFCC)特征进行设计。此外,我们复现了若干前沿方法,特别是卷积神经网络(CNN)和基于注意力的卷积循环神经网络(带有注意力的CRNN),并与我们提出的基于CRNN的方法进行了比较分析。我们在十三种不同的印度语言上进行了全面评估,结果显示我们的模型分类准确率超过98%。对于语言相似性较高的语种,该LID模型表现出97%至100%的高性能水平。所提出的LID模型展现出对更多语言的高度可扩展性,并表现出强大的抗噪能力:当应用于欧洲语言(EU)数据集时,在嘈杂环境中仍能达到91.2%的准确率。