This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-based GUI efficiently transcribes and translates input Hindi videos. By integrating XLSR Wav2Vec2 and mBART, the system aligns the translated text with the video timeline, delivering an accessible solution for multilingual video content transcription and translation for personalized voice.
翻译:本研究解决了以最少数据训练个性化语音ASR模型的挑战。仅利用YouTube视频中14分钟的自定义音频,我们采用基于检索的语音转换(RVC)构建了自定义Common Voice 16.0语料库。随后,在该数据集上精调了跨语言自监督表征(XLSR)Wav2Vec2模型。开发的基于Web的图形界面能高效转录并翻译输入的印地语视频。通过集成XLSR Wav2Vec2与mBART,系统将翻译文本与视频时间线对齐,为多语言视频内容的个性化语音转录与翻译提供了可访问的解决方案。