This paper proposes a method for extracting a lightweight subset from a text-to-speech (TTS) corpus ensuring synthetic speech quality. In recent years, methods have been proposed for constructing large-scale TTS corpora by collecting diverse data from massive sources such as audiobooks and YouTube. Although these methods have gained significant attention for enhancing the expressive capabilities of TTS systems, they often prioritize collecting vast amounts of data without considering practical constraints like storage capacity and computation time in training, which limits the available data quantity. Consequently, the need arises to efficiently collect data within these volume constraints. To address this, we propose a method for selecting the core subset~(known as \textit{core-set}) from a TTS corpus on the basis of a \textit{diversity metric}, which measures the degree to which a subset encompasses a wide range. Experimental results demonstrate that our proposed method performs significantly better than the baseline phoneme-balanced data selection across language and corpus size.
翻译:本文提出一种从文本转语音(TTS)语料库中提取轻量化子集的方法,以保障合成语音质量。近年来,通过从有声读物、YouTube等海量来源采集多样化数据构建大规模TTS语料库的方法备受关注。尽管这些方法因提升TTS系统表现力而取得显著研究进展,但常优先考虑采集海量数据,忽视了训练中的存储容量与计算时间等实际约束,导致可用数据量受限。因此,需要在数据量约束下高效采集数据。为解决该问题,我们提出一种基于多样性度量的TTS语料库核心子集选择方法(即核心集),该度量衡量子集所涵盖范围的广度。实验结果表明,在不同语言及语料库规模下,所提方法显著优于基于音素平衡的基线数据选择方法。