In this research paper, we delve into the topics of Speech Diarization and Automatic Speech Recognition (ASR). Speech diarization involves the separation of individual speakers within an audio stream. By employing the ASR transcript, the diarization process aims to segregate each speaker's utterances, grouping them based on their unique audio characteristics. On the other hand, Automatic Speech Recognition refers to the capability of a machine or program to identify and convert spoken words and phrases into a machine-readable format. In our speech diarization approach, we utilize the Gaussian Mixer Model (GMM) to represent speech segments. The inter-cluster distance is computed based on the GMM parameters, and the distance threshold serves as the stopping criterion. ASR entails the conversion of an unknown speech waveform into a corresponding written transcription. The speech signal is analyzed using synchronized algorithms, taking into account the pitch frequency. Our primary objective typically revolves around developing a model that minimizes the Word Error Rate (WER) metric during speech transcription.
翻译:本研究深入探讨了说话人日志与自动语音识别。说话人日志旨在区分音频流中的不同说话人。通过利用ASR转录结果,该日志过程依据每位说话人独特的音频特征,将其话语分离并分组。另一方面,自动语音识别是指机器或程序识别并转换口语词汇及短语为机器可读格式的能力。在我们的说话人日志方法中,采用高斯混合模型来表示语音片段。基于GMM参数计算簇间距离,并以距离阈值作为停止准则。ASR涉及将未知语音波形转换为对应的书面转录文本。通过同步算法分析语音信号,并考虑基音频率。我们的主要目标通常围绕开发一个在语音转录过程中最小化词错误率指标的模型。