Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.
翻译:自动语音识别(ASR)是计算语言学的关键领域,致力于开发使计算机能够将口语转换为文本的技术。该领域融合了语言学与机器学习。ASR模型通过监督学习将语音音频映射到文本转录,需要处理真实且不受限制的文本。文本转语音系统直接处理真实文本,而ASR系统则依赖于在大型文本语料库上训练的语言模型。高质量的转录数据对于训练预测模型至关重要。本研究包含两个主要组成部分:开发一个网络应用程序以及设计一个用于语音识别的网络界面。该网络应用程序使用JavaScript和Node.js构建,用于管理大量音频文件及其转录文本,促进ASR转录文本的协作式人工校正。它采用客户端-服务器架构实时运行。用于语音识别的网络界面可从运行该网络应用的任何设备录制16 kHz单声道音频,执行语音活动检测(VAD),并将音频发送至识别引擎。VAD可检测人声的存在,有助于高效处理语音,并减少非语音时段的不必要处理,从而在VoIP应用中节省计算资源和网络带宽。研究的最后阶段测试了一种神经网络,用于将语音信号精确对齐到隐马尔可夫模型(HMM)状态。这包括实现一种新颖的反向传播方法,该方法利用了节点共激活的先验统计信息。