Selecting application scenarios matching data is important for the automatic speech recognition (ASR) training, but it is difficult to measure the matching degree of the training corpus. This study proposes a unsupervised target-aware data selection method based on speech corpora divergence (SCD), which can measure the similarity between two speech corpora. We first use the self-supervised Hubert model to discretize the speech corpora into label sequence and calculate the N-gram probability distribution. Then we calculate the Kullback-Leibler divergence between the N-grams as the SCD. Finally, we can choose the subset which has minimum SCD to the target corpus for annotation and training. Compared to previous data selection method, the SCD data selection method can focus on more acoustic details and guarantee the diversity of the selected set. We evaluate our method on different accents from Common Voice. Experiments show that the proposed SCD data selection can realize 14.8% relative improvements to the random selection, comparable or even superior to the result of supervised selection.
翻译:选择与应用场景匹配的数据对于自动语音识别(ASR)训练至关重要,但衡量训练语料库的匹配程度存在困难。本研究提出一种基于语音语料库差异(SCD)的无监督目标感知数据选择方法,能够度量两个语音语料库之间的相似性。我们首先利用自监督Hubert模型将语音语料库离散化为标签序列,并计算N-gram概率分布。然后计算N-gram之间的Kullback-Leibler散度作为SCD值。最终,我们可选择与目标语料库SCD值最小的子集进行标注与训练。相较于以往的数据选择方法,SCD数据选择方法能更关注声学细节,同时保证所选子集的多样性。我们在Common Voice的不同口音上评估该方法,实验表明,所提出的SCD数据选择方法相比随机选择可实现14.8%的相对提升,效果与监督选择相当甚至更优。