In the field of spoken language understanding, systems like Whisper and Multilingual Massive Speech (MMS) have shown state-of-the-art performances. This study is dedicated to a comprehensive exploration of the Whisper and MMS systems, with a focus on assessing biases in automatic speech recognition (ASR) inherent to casual conversation speech specific to the Portuguese language. Our investigation encompasses various categories, including gender, age, skin tone color, and geo-location. Alongside traditional ASR evaluation metrics such as Word Error Rate (WER), we have incorporated p-value statistical significance for gender bias analysis. Furthermore, we extensively examine the impact of data distribution and empirically show that oversampling techniques alleviate such stereotypical biases. This research represents a pioneering effort in quantifying biases in the Portuguese language context through the application of MMS and Whisper, contributing to a better understanding of ASR systems' performance in multilingual settings.
翻译:在口语理解领域,Whisper 和多语种大规模语音(MMS)等系统展现了最先进的性能。本研究致力于全面探索 Whisper 和 MMS 系统,重点关注葡萄牙语日常对话语音中固有的自动语音识别(ASR)偏差评估。我们的调查涵盖多个类别,包括性别、年龄、肤色和地理位置。除了传统的 ASR 评估指标(如词错误率,WER)外,我们还引入了性别偏差分析的 p 值统计显著性检验。此外,我们深入检验了数据分布的影响,并通过实验表明过采样技术可以减轻此类刻板印象偏差。这项研究是通过应用 MMS 和 Whisper 在葡萄牙语语境中量化偏差的开创性工作,有助于更深入地理解 ASR 系统在多语言环境中的性能。