Speech intelligibility can be degraded due to multiple factors, such as noisy environments, technical difficulties or biological conditions. This work is focused on the development of an automatic non-intrusive system for predicting the speech intelligibility level in this latter case. The main contribution of our research on this topic is the use of Long Short-Term Memory (LSTM) networks with log-mel spectrograms as input features for this purpose. In addition, this LSTM-based system is further enhanced by the incorporation of a simple attention mechanism that is able to determine the more relevant frames to this task. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. Results show that the attention LSTM architecture outperforms both, a reference Support Vector Machine (SVM)-based system with hand-crafted features and a LSTM-based system with Mean-Pooling.
翻译:语音清晰度可能因多种因素而降低,例如嘈杂环境、技术困难或生理条件。本研究专注于开发一种自动非侵入式系统,用于预测后一种情况下的语音清晰度水平。我们在此课题上的主要贡献在于使用对数梅尔频谱图作为输入特征的长短期记忆(LSTM)网络。此外,该系统通过引入一种简单的注意力机制进一步增强,该机制能够确定与任务更相关的帧。所提出的模型利用包含不同严重程度构音障碍语音的UA-Speech数据库进行评估。结果表明,注意力LSTM架构在性能上均优于基于手工特征的支持向量机(SVM)参考系统以及采用均值池化的LSTM系统。