This paper introduces GLOBE, a high-quality English corpus with worldwide accents, specifically designed to address the limitations of current zero-shot speaker adaptive Text-to-Speech (TTS) systems that exhibit poor generalizability in adapting to speakers with accents. Compared to commonly used English corpora, such as LibriTTS and VCTK, GLOBE is unique in its inclusion of utterances from 23,519 speakers and covers 164 accents worldwide, along with detailed metadata for these speakers. Compared to its original corpus, i.e., Common Voice, GLOBE significantly improves the quality of the speech data through rigorous filtering and enhancement processes, while also populating all missing speaker metadata. The final curated GLOBE corpus includes 535 hours of speech data at a 24 kHz sampling rate. Our benchmark results indicate that the speaker adaptive TTS model trained on the GLOBE corpus can synthesize speech with better speaker similarity and comparable naturalness than that trained on other popular corpora. We will release GLOBE publicly after acceptance. The GLOBE dataset is available at https://globecorpus.github.io/.
翻译:本文介绍了GLOBE——一个包含全球口音的高质量英语语料库,专门为解决当前零样本说话人自适应文本转语音(TTS)系统在适应带口音说话人时泛化能力不足的局限性而设计。相较于LibriTTS、VCTK等常用英语语料库,GLOBE的独特性在于收录了来自23,519名说话人的语音样本,覆盖全球164种口音,并提供了这些说话人的详细元数据。相较于其原始语料库Common Voice,GLOBE通过严格的筛选与增强流程显著提升了语音数据质量,同时补全了所有缺失的说话人元数据。最终整理完成的GLOBE语料库包含535小时、采样率为24 kHz的语音数据。基准测试结果表明,基于GLOBE语料库训练的说话人自适应TTS模型,相较于基于其他主流语料库训练的模型,能够合成出具有更优说话人相似度与相当自然度的语音。我们将在论文录用后公开GLOBE语料库。数据集可通过https://globecorpus.github.io/获取。