Self-supervised learning (SSL) has been dramatically successful not only in monolingual but also in cross-lingual settings. However, since the two settings have been studied individually in general, there has been little research focusing on how effective a cross-lingual model is in comparison with a monolingual model. In this paper, we investigate this fundamental question empirically with Japanese automatic speech recognition (ASR) tasks. First, we begin by comparing the ASR performance of cross-lingual and monolingual models for two different language tasks while keeping the acoustic domain as identical as possible. Then, we examine how much unlabeled data collected in Japanese is needed to achieve performance comparable to a cross-lingual model pre-trained with tens of thousands of hours of English and/or multilingual data. Finally, we extensively investigate the effectiveness of SSL in Japanese and demonstrate state-of-the-art performance on multiple ASR tasks. Since there is no comprehensive SSL study for Japanese, we hope this study will guide Japanese SSL research.
翻译:自监督学习不仅在单语环境中取得了显著成功,在跨语言环境中同样效果显著。然而,由于这两种设置通常被单独研究,目前鲜有研究关注跨语言模型与单语模型相比究竟效果如何。本文通过日语自动语音识别任务对此基本问题进行了实证研究。首先,我们在尽可能保持声学域相同的前提下,比较了跨语言模型和单语模型在不同语言任务上的语音识别性能。接着,我们探究了需要多少日语无标注数据才能达到与使用数万小时英语和/或多语数据预训练的跨语言模型相当的性能。最后,我们广泛研究了自监督学习在日语中的有效性,并在多个语音识别任务上展示了最先进的性能。由于目前尚无针对日语的系统性自监督学习研究,我们希望本研究能为日语自监督学习研究提供指导。