Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use controllable speech synthesis to generate speech with varied styles and content. Surprisingly, we find that the text content of the synthetic data, rather than style, is important for speaker adaptation. These results lead us to propose a data selection strategy for ASR personalization based on speech content.
翻译:将通用语音识别模型适配至特定个体是一个挑战性问题,原因在于个性化数据的稀缺性。近期研究提出通过个性化文本语音合成来增加训练数据量。本文针对这一策略提出两个基本问题:合成数据在何种条件下对个性化有效,以及为何在这些情况下有效?针对第一个问题,我们将当前最先进的自动语音识别(ASR)模型适配至四个代表不同说话者类型的基准数据集中的目标说话者。结果表明,合成数据在ASR个性化中始终有效,尤其在(i)目标说话者在全局数据中代表性不足时,以及(ii)全局模型容量受限时效果更为显著。针对第二个问题——个性化合成数据为何有效,我们通过可控语音合成生成具有不同风格和内容的语音。令人惊讶的是,我们发现合成数据的文本内容(而非风格)对说话者适配至关重要。基于这些发现,我们提出了一种基于语音内容的ASR个性化数据选择策略。