Evaluating the Impact of Discriminative and Generative E2E Speech Enhancement Models on Syllable Stress Preservation

Automatic syllable stress detection is a crucial component in Computer-Assisted Language Learning (CALL) systems for language learners. Current stress detection models are typically trained on clean speech, which may not be robust in real-world scenarios where background noise is prevalent. To address this, speech enhancement (SE) models, designed to enhance speech by removing noise, might be employed, but their impact on preserving syllable stress patterns is not well studied. This study examines how different SE models, representing discriminative and generative modeling approaches, affect syllable stress detection under noisy conditions. We assess these models by applying them to speech data with varying signal-to-noise ratios (SNRs) from 0 to 20 dB, and evaluating their effectiveness in maintaining stress patterns. Additionally, we explore different feature sets to determine which ones are most effective for capturing stress patterns amidst noise. To further understand the impact of SE models, a human-based perceptual study is conducted to compare the perceived stress patterns in SE-enhanced speech with those in clean speech, providing insights into how well these models preserve syllable stress as perceived by listeners. Experiments are performed on English speech data from non-native speakers of German and Italian. And the results reveal that the stress detection performance is robust with the generative SE models when heuristic features are used. Also, the observations from the perceptual study are consistent with the stress detection outcomes under all SE models.

翻译：自动音节重音检测是计算机辅助语言学习系统中面向语言学习者的关键组成部分。当前的重音检测模型通常在纯净语音上进行训练，在背景噪声普遍存在的真实场景中可能缺乏鲁棒性。为解决此问题，可考虑采用旨在通过降噪增强语音的语音增强模型，但其对音节重音模式的保持效果尚未得到充分研究。本研究探讨了代表判别式与生成式建模方法的不同语音增强模型在噪声条件下对音节重音检测的影响。我们通过将这些模型应用于信噪比在0至20分贝范围内的含噪语音数据，评估其在保持重音模式方面的有效性。此外，我们探索了不同的特征集，以确定哪些特征在噪声环境下最能有效捕捉重音模式。为深入理解语音增强模型的影响，本研究开展了基于人耳的感知实验，比较语音增强处理后的语音与纯净语音中被感知的重音模式，从而揭示这些模型在听感层面对音节重音的保持能力。实验使用非母语德语及意大利语者的英语语音数据进行。结果表明：当采用启发式特征时，生成式语音增强模型能保持稳健的重音检测性能；且感知实验的观察结果与所有语音增强模型下的重音检测结果具有一致性。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日