Developing Text-to-Speech (TTS) systems that can synthesize natural breath is essential for human-like voice agents but requires extensive manual annotation of breath positions in training data. To this end, we propose a self-training method for training a breath detection model that can automatically detect breath positions in speech. Our method trains the model using a large speech corpus and involves: 1) annotation of limited breath sounds utilizing a rule-based approach, and 2) iterative augmentation of these annotations through pseudo-labeling based on the model's predictions. Our detection model employs Conformer blocks with down-/up-sampling layers, enabling accurate frame-wise breath detection. We investigate its effectiveness in multi-speaker TTS using text transcripts with detected breath marks. The results indicate that using our proposed model for breath detection and breath mark insertion synthesizes breath-contained speech more naturally than a baseline model.
翻译:开发能够合成自然呼吸的文本转语音(TTS)系统对于构建类人语音代理至关重要,但这需要大量人工标注训练数据中的呼吸位置。为此,我们提出了一种自训练方法来训练呼吸检测模型,该模型能够自动检测语音中的呼吸位置。我们的方法使用大规模语音语料库训练模型,包括:1)利用基于规则的方法对有限的呼吸声音进行标注;2)通过基于模型预测的伪标签迭代扩充这些标注。该检测模型采用带有降采样/上采样层的Conformer模块,实现精确的逐帧呼吸检测。我们通过使用检测到的呼吸标记的文本转录,研究了该方法在多说话人TTS中的有效性。结果表明,与基线模型相比,使用我们提出的呼吸检测与呼吸标记插入模型合成的含呼吸语音更加自然。