Automatic syllable stress detection is a crucial component in Computer-Assisted Language Learning (CALL) systems for language learners. Current stress detection models are typically trained on clean speech, which may not be robust in real-world scenarios where background noise is prevalent. To address this, speech enhancement (SE) models, designed to enhance speech by removing noise, might be employed, but their impact on preserving syllable stress patterns is not well studied. This study examines how different SE models, representing discriminative and generative modeling approaches, affect syllable stress detection under noisy conditions. We assess these models by applying them to speech data with varying signal-to-noise ratios (SNRs) from 0 to 20 dB, and evaluating their effectiveness in maintaining stress patterns. Additionally, we explore different feature sets to determine which ones are most effective for capturing stress patterns amidst noise. To further understand the impact of SE models, a human-based perceptual study is conducted to compare the perceived stress patterns in SE-enhanced speech with those in clean speech, providing insights into how well these models preserve syllable stress as perceived by listeners. Experiments are performed on English speech data from non-native speakers of German and Italian. And the results reveal that the stress detection performance is robust with the generative SE models when heuristic features are used. Also, the observations from the perceptual study are consistent with the stress detection outcomes under all SE models.
翻译:自动音节重音检测是计算机辅助语言学习系统中面向语言学习者的关键组成部分。当前的重音检测模型通常在纯净语音上进行训练,在背景噪声普遍存在的真实场景中可能缺乏鲁棒性。为解决此问题,可考虑采用旨在通过降噪增强语音的语音增强模型,但其对音节重音模式的保持效果尚未得到充分研究。本研究探讨了代表判别式与生成式建模方法的不同语音增强模型在噪声条件下对音节重音检测的影响。我们通过将这些模型应用于信噪比在0至20分贝范围内的含噪语音数据,评估其在保持重音模式方面的有效性。此外,我们探索了不同的特征集,以确定哪些特征在噪声环境下最能有效捕捉重音模式。为深入理解语音增强模型的影响,本研究开展了基于人耳的感知实验,比较语音增强处理后的语音与纯净语音中被感知的重音模式,从而揭示这些模型在听感层面对音节重音的保持能力。实验使用非母语德语及意大利语者的英语语音数据进行。结果表明:当采用启发式特征时,生成式语音增强模型能保持稳健的重音检测性能;且感知实验的观察结果与所有语音增强模型下的重音检测结果具有一致性。