In this study, we have presented a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture. Traditional methods for calculating STOI typically requires clean reference speech, which limits their applicability in the real world. To address this, numerous deep learning-based nonintrusive speech assessment models have garnered significant interest. Many studies have achieved commendable performance, but there is room for further improvement. We propose the use of bottleneck transformer, incorporating convolution blocks for learning frame-level features and a multi-head self-attention (MHSA) layer to aggregate the information. These components enable the transformer to focus on the key aspects of the input data. Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios compared to the state-of-the-art model using self-supervised learning (SSL) and spectral features as inputs.
翻译:本研究提出了一种利用瓶颈Transformer架构预测短时客观可懂度(STOI)指标的新方法。传统STOI计算方法通常需要纯净参考语音,这限制了其在实际场景中的应用。为解决此问题,大量基于深度学习的非侵入式语音评估模型受到广泛关注。许多研究已取得值得称赞的性能,但仍存在改进空间。我们提出采用瓶颈Transformer架构,该架构结合卷积块学习帧级特征,并利用多头自注意力(MHSA)层聚合信息。这些组件使Transformer能够聚焦输入数据的关键特征。与采用自监督学习(SSL)和频谱特征作为输入的先进模型相比,我们的模型在已知和未知场景下均表现出更高的相关性及更低的均方误差。