This study explores fine-tuning multilingual ASR (Automatic Speech Recognition) models, specifically OpenAI's Whisper-Tiny, to improve performance in Japanese. While multilingual models like Whisper offer versatility, they often lack precision in specific languages. Conversely, monolingual models like ReazonSpeech excel in language-specific tasks but are less adaptable. Using Japanese-specific datasets and Low-Rank Adaptation (LoRA) along with end-to-end (E2E) training, we fine-tuned Whisper-Tiny to bridge this gap. Our results show that fine-tuning reduced Whisper-Tiny's Character Error Rate (CER) from 32.7 to 20.8 with LoRA and to 14.7 with end-to-end fine-tuning, surpassing Whisper-Base's CER of 20.2. However, challenges with domain-specific terms remain, highlighting the need for specialized datasets. These findings demonstrate that fine-tuning multilingual models can achieve strong language-specific performance while retaining their flexibility. This approach provides a scalable solution for improving ASR in resource-constrained environments and languages with complex writing systems like Japanese.
翻译:本研究探讨了通过微调多语言自动语音识别模型(特别是OpenAI的Whisper-Tiny模型)以提升日语识别性能的方法。尽管Whisper等多语言模型具有通用性,但在特定语言上往往缺乏精确性。相比之下,ReazonSpeech等单语模型在特定语言任务上表现优异,但适应性较弱。通过使用日语专用数据集,结合低秩适配技术与端到端训练方法,我们对Whisper-Tiny模型进行微调以弥补这一差距。实验结果表明:经过微调的Whisper-Tiny模型,其字符错误率从原始32.7降至LoRA微调后的20.8,端到端微调后进一步降至14.7,优于Whisper-Base模型20.2的字符错误率。然而,领域专业术语的识别仍存在挑战,这凸显了专用数据集的必要性。这些发现表明,通过对多语言模型进行微调,可以在保持模型灵活性的同时实现优异的特定语言性能。该方法为资源受限环境及日语等复杂书写系统的语言提供了可扩展的自动语音识别改进方案。