This paper describes Tallinn University of Technology (TalTech) systems developed for the ASRU MADASR 2023 Challenge. The challenge focuses on automatic speech recognition of dialect-rich Indian languages with limited training audio and text data. TalTech participated in two tracks of the challenge: Track 1 that allowed using only the provided training data and Track 3 which allowed using additional audio data. In both tracks, we relied on wav2vec2.0 models. Our methodology diverges from the traditional procedure of finetuning pretrained wav2vec2.0 models in two key points: firstly, through the implementation of the aligned data augmentation technique to enhance the linguistic diversity of the training data, and secondly, via the application of deep prefix tuning for dialect adaptation of wav2vec2.0 models. In both tracks, our approach yielded significant improvements over the provided baselines, achieving the lowest word error rates across all participating teams.
翻译:本文描述了塔林理工大学(TalTech)为ASRU MADASR 2023挑战赛开发的系统。该挑战赛聚焦于训练音频和文本数据有限的方言丰富印度语言的自动语音识别。TalTech参与了挑战赛的两个赛道:赛道1仅允许使用提供的训练数据,赛道3则允许使用额外的音频数据。在两个赛道中,我们均采用wav2vec2.0模型。我们的方法在以下两个关键点上偏离了传统预训练wav2vec2.0模型的微调流程:首先,实施对齐数据增强技术以增强训练数据的语言多样性;其次,应用深度前缀调优技术实现wav2vec2.0模型的方言适应。在两个赛道中,我们的方法相比提供的基线取得了显著改进,在所有参赛团队中实现了最低的词错误率。