We examine the text-free speech representations of raw audio obtained from a self-supervised learning (SSL) model by analyzing the synthesized speech using the SSL representations instead of conventional text representations. Since raw audio does not have paired speech representations as transcribed texts do, obtaining speech representations from unpaired speech is crucial for augmenting available datasets for speech synthesis. Specifically, the proposed speech synthesis is conducted using discrete symbol representations from the SSL model in comparison with text representations, and analytical examinations of the synthesized speech have been carried out. The results empirically show that using text representations is advantageous for preserving semantic information, while using discrete symbol representations is superior for preserving acoustic content, including prosodic and intonational information.
翻译:本研究通过分析使用自监督学习模型表征而非传统文本表征合成的语音,系统考察了从自监督学习模型获取的原始音频无文本语音表征特性。由于原始音频不具备转录文本所具有的配对语音表征,从非配对语音中获取语音表征对于扩充语音合成可用数据集至关重要。具体而言,本研究提出的语音合成方法采用SSL模型生成的离散符号表征与文本表征进行对比分析,并对合成语音开展了系统性检验。实证结果表明:文本表征在保持语义信息方面具有优势,而离散符号表征在保留声学内容(包括韵律和语调信息)方面表现更优。