Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.
翻译:低资源语言和特定领域的对话式自动语音识别(ASR)受到领域匹配的多说话人训练数据稀缺的限制。我们提出了一种数据增强流水线,用于生成带有参与者元数据的场景级对话,将说话人属性映射到文本转语音(TTS)音色配置,并将合成的话语组装成具有说话人意识的模拟对话。我们评估了五个大型语言模型(LLM)家族,分别在单生成器、固定预算混合以及规模扩展设置下,所有模型均采用相同的FastConformer-Large训练方案。我们在匈牙利语BEA-Dialogue基准语料库上进行了全面评估,该方法本身可适用于任何语言,前提是各组件拥有相应资源。结果表明,合成对话一致性地提升了语音识别性能,但生成器的选择与数据构成显著影响了增益幅度。我们最大的训练配置仅使用了67小时真实对话与636小时模拟数据,在评估基准上的表现优于基于2700小时匈牙利语语音训练的零样本模型。这些发现表明,利用LLM生成的对话数据经TTS合成后,可作为真实对话语料库的实用补充,用于语音模型训练。