State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn) that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that DITTO is 3-5 times more label-efficient than other speech selection methods on the IndicTTS and L2 datasets.
翻译:摘要:最先进的自动语音识别(ASR)系统在不同语音口音上表现出性能差异。为提升特定目标口音的识别性能,常用解决方案是利用口音标注语音对ASR模型进行微调,但获取大量针对特定口音的标注语音具有挑战性。如何选择最能代表目标口音的信息性子集进行有效ASR微调成为关键问题。为此,我们提出DITTO(数据高效与公平目标子集选择框架),该框架利用子模互信息函数作为采集函数,在固定预算内寻找与目标口音匹配的最具信息量的语音语句集合。DITTO的重要特性在于支持多口音的公平定位,即当ASR模型需在多个口音上表现优异时,能自动从多个口音中选取代表性数据点。实验表明,在IndicTTS和L2数据集上,DITTO的标签效率较其他语音选择方法提升3-5倍。