Despite advancements in speech recognition, accented speech remains challenging. While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). Our exploratory analysis of speech embeddings representing 100+ African accents reveals interesting spatial accent relationships highlighting geographic and genealogical similarities, capturing consistent phonological, and morphological regularities, all learned empirically from speech. Furthermore, we discover accent relationships previously uncharacterized by the Ethnologue. Through empirical evaluation, we demonstrate the effectiveness of AccentFold by showing that, for out-of-distribution (OOD) accents, sampling accent subsets for training based on AccentFold information outperforms strong baselines a relative WER improvement of 4.6%. AccentFold presents a promising approach for improving ASR performance on accented speech, particularly in the context of African accents, where data scarcity and budget constraints pose significant challenges. Our findings emphasize the potential of leveraging linguistic relationships to improve zero-shot ASR adaptation to target accents.
翻译:尽管语音识别技术取得了进步,但带口音的语音仍然具有挑战性。以往的研究重点在于建模技术或创建带口音的语音数据集,然而,由于口音种类繁多且预算有限,特别是在非洲语境下,收集足够的数据仍不切实际。为应对这些挑战,我们提出了AccentFold,一种利用学习到的口音嵌入之间的空间关系来改进下游自动语音识别(ASR)的方法。我们对代表100多种非洲口音的语音嵌入进行了探索性分析,揭示了口音间有趣的空间关系,这些关系突显了地理和谱系上的相似性,捕捉了一致的音系和形态学规律,且所有发现均通过语音经验学习得到。此外,我们还发现了Ethnologue先前未描述过的口音关系。通过实证评估,我们展示了AccentFold的有效性:对于分布外(OOD)口音,基于AccentFold信息选择口音子集进行训练,相比强基线实现了4.6%的相对词错误率(WER)改进。AccentFold为提升带口音语音的ASR性能提供了一种有前景的方法,尤其是在数据稀缺和预算限制构成重大挑战的非洲口音背景下。我们的研究结果强调了利用语言关系来改进针对目标口音的零样本语音识别自适应的潜力。