Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We investigate the impact of generated data on speech recognition in two scenarios: using in-domain CS text and a zero-shot approach with synthesized CS text. Empirical results highlight up to 34.4% and 16.2% relative reductions in Mixed-Error Rate and Word-Error Rate for in-domain and zero-shot scenarios, respectively. Lastly, we demonstrate that CS augmentation bolsters the model's code-switching inclination and reduces its monolingual bias.
翻译:设计有效的代码切换(CS)自动语音识别(ASR)系统通常依赖于转录CS资源的可用性。为应对数据稀缺问题,本文提出Speech Collage方法,该方法通过拼接音频片段从单语语料库合成CS数据。我们进一步采用重叠相加方法提升音频生成的平滑度。我们研究了生成数据在两种场景下对语音识别的影响:使用领域内CS文本,以及利用合成CS文本的零样本方法。实验结果表明,在领域内和零样本场景下,混合错误率和词错误率分别相对降低34.4%和16.2%。最后,我们证明CS数据增强能提升模型的代码切换倾向,并降低其单语偏差。