Prosody plays a crucial role in speech perception, influencing both human understanding and automatic speech recognition (ASR) systems. Despite its importance, prosodic stress remains under-studied due to the challenge of efficiently analyzing it. This study explores fine-tuning OpenAI's Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. Using a dataset of 66 native English speakers, including male, female, neurotypical, and neurodivergent individuals, we assess the model's ability to generalize stress patterns and classify speakers by neurotype and gender based on brief speech samples. Our results highlight near-human accuracy in ASR performance across all three stress types and near-perfect precision in classifying gender and neurotype. By improving prosody-aware ASR, this work contributes to equitable and robust transcription technologies for diverse populations.
翻译:韵律在语音感知中起着至关重要的作用,既影响人类理解,也影响自动语音识别(ASR)系统。尽管其重要性显著,但由于高效分析韵律重音存在挑战,该领域研究仍显不足。本研究探索通过微调OpenAI的Whisper large-v2 ASR模型来识别语音中的短语重音、词汇重音和对比重音。利用包含66名英语母语者的数据集(涵盖男性、女性、神经典型个体和神经多样性个体),我们评估了该模型在推广重音模式方面的能力,以及基于简短语音样本对说话者的神经类型和性别进行分类的性能。我们的研究结果表明,该模型在所有三种重音类型的ASR性能上达到了接近人类的准确度,并且在性别与神经类型分类上实现了近乎完美的精确度。通过改进具备韵律感知能力的ASR系统,本工作为面向多样化群体的公平且鲁棒的转录技术做出了贡献。