Unsupervised representation learning has recently helped automatic speech recognition (ASR) to tackle tasks with limited labeled data. Following this, hardware limitations and applications give rise to the question how to take advantage of large pre-trained models efficiently and reduce their complexity. In this work, we study a challenging low resource conversational telephony speech corpus from the medical domain in Vietnamese and German. We show the benefits of using unsupervised techniques beyond simple fine-tuning of large pre-trained models, discuss how to adapt them to a practical telephony task including bandwidth transfer and investigate different data conditions for pre-training and fine-tuning. We outperform the project baselines by 22% relative using pretraining techniques. Further gains of 29% can be achieved by refinements of architecture and training and 6% by adding 0.8 h of in-domain adaptation data.
翻译:无监督表示学习近期助力自动语音识别(ASR)在标注数据受限任务中取得进展。受此启发,硬件限制与应用场景催生出如何高效利用大规模预训练模型并降低其复杂度的关键问题。本研究聚焦于越南语和德语医学领域的低资源电话对话语音语料库,通过挑战性实验验证:超越简单微调预训练大模型的无监督技术优势,探讨其适配实际电话任务(含带宽转换)的路径,并研究不同数据条件对预训练与微调的影响。采用预训练技术后,项目基线指标相对提升22%;通过架构与训练优化可额外获得29%的性能增益,而引入0.8小时领域适配数据则贡献6%的提升。