This paper proposes two innovative methodologies to construct customized Common Voice datasets for low-resource languages like Hindi. The first methodology leverages Bark, a transformer-based text-to-audio model developed by Suno, and incorporates Meta's enCodec and a pre-trained HuBert model to enhance Bark's performance. The second methodology employs Retrieval-Based Voice Conversion (RVC) and uses the Ozen toolkit for data preparation. Both methodologies contribute to the advancement of ASR technology and offer valuable insights into addressing the challenges of constructing customized Common Voice datasets for under-resourced languages. Furthermore, they provide a pathway to achieving high-quality, personalized voice generation for a range of applications.
翻译:本文提出了两种创新方法,用于为印地语等低资源语言构建定制化Common Voice数据集。第一种方法利用Bark——由Suno开发的基于Transformer的文本到音频模型,并整合Meta的enCodec和预训练的HuBert模型以提升Bark的性能。第二种方法采用检索式语音转换(RVC),并使用Ozen工具包进行数据预处理。两种方法均推动了自动语音识别(ASR)技术的发展,为解决低资源语言定制化Common Voice数据集构建的挑战提供了重要见解。此外,它们还为多种应用场景下实现高质量个性化语音生成开辟了路径。