Speech technology has improved greatly for norm speakers, i.e., adult native speakers of a language without speech impediments or strong accents. However, non-norm or diverse speaker groups show a distinct performance gap with norm speakers, which we refer to as bias. In this work, we aim to reduce bias against different age groups and non-native speakers of Dutch. For an end-to-end (E2E) ASR system, we use state-of-the-art speed perturbation and spectral augmentation as data augmentation techniques and explore Vocal Tract Length Normalization (VTLN) to normalise for spectral differences due to differences in anatomy. The combination of data augmentation and VTLN reduced the average WER and bias across various diverse speaker groups by 6.9% and 3.9%, respectively. The VTLN model trained on Dutch was also effective in improving performance of Mandarin Chinese child speech, thus, showing generalisability across languages
翻译:语音技术针对标准使用者(即无语言障碍或强烈口音的成年母语者)已取得显著进步。然而,非标准或多样化说话人群与标准使用者之间存在明显的性能差距,我们称之为偏差。本研究旨在减少针对不同年龄段及荷兰语非母语者的偏差。针对端到端(E2E)自动语音识别系统,我们采用最先进的语速扰动与频谱增强作为数据增强技术,并探索声道长度归一化(VTLN)以标准化因解剖结构差异导致的频谱差异。数据增强与VTLN的组合使多个多样化说话人群的平均词错误率(WER)与偏差分别降低了6.9%和3.9%。基于荷兰语训练的VTLN模型在提升普通话儿童语音性能方面同样有效,展现了其跨语言的泛化能力。