Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

语音识别 · 模型评估 · Processing（编程语言） · 峰值 · 端到端 ·

2023 年 7 月 6 日

翻译：用于端到端构音障碍语音处理任务的伽马通图表示：语音识别、说话人识别与清晰度评估

Aref Farhadipour,Hadi Veisi

from arxiv, 12 pages, 8 figures

Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.

翻译：构音障碍是一种导致人类语音系统紊乱、降低语音质量和清晰度的残疾。受此影响，常规语音处理系统无法正常处理受损语音。该残疾通常伴随生理障碍。因此，设计一个能在智能家居中通过接收语音指令执行任务的系统具有重要意义。本研究提出将伽马通图作为一种有效方法，以包含判别性细节的方式表征音频文件，并将其作为卷积神经网络的输入。换言之，我们将每个语音文件转换为图像，并设计图像识别系统以在不同场景下对语音进行分类。所提出的CNN基于预训练Alexnet的迁移学习方法。本研究评估了该语音识别、说话人识别和清晰度评估系统的效率。在UA数据集上的实验结果显示：说话人依赖模式下语音识别系统准确率达91.29%，文本依赖模式下说话人识别系统准确率达87.74%，二分类模式下清晰度评估系统准确率达96.47%。最后，我们提出一种全自动多网络语音识别系统。该系统与二分类清晰度评估系统采用级联结构，后者的输出用于激活各语音识别网络。该架构实现了92.3%的单词识别率（WRR）。本文提供源代码。