We propose a method of simulating the human process of foreign accentuation using Generative Spoken Language Model (GSLM) only with native speech corpora. When one listens to spoken words of a foreign language and repeats them, the repeated speech is often with the accent of that listener's L1. This is said to be because the spoken words are mentally represented as a sequence of phonological units of the L1, and those units are used for oral reproduction. We simulate this process by inputting speech of language A into GSLM of language B to add B's accent onto the input speech. The process of running ASR of the L1 for foreign input speech and giving the ASR result to TTS of the L1 can be viewed as a naive implementation of this approach. The results of our experiments show that the synthesized accent of the output speech is highly natural, compared to real samples of A generated by speakers whose L1 is B, and that the degree of accentuation is controllable.
翻译:我们提出了一种仅使用母语语料库、通过生成式口语语言模型(GSLM)来模拟人类外语口音形成过程的方法。当人们听到外语口语词汇并复述时,其复述的语音常带有该听者母语(L1)的口音。这通常归因于听到的词汇在心理上被表征为L1音系单元的序列,并利用这些单元进行口头复现。我们通过将语言A的语音输入至语言B的GSLM中,从而为输入语音附加上B语言口音,以此模拟该过程。将外语输入语音通过L1的自动语音识别(ASR)系统处理,再将识别结果输入L1的文本转语音(TTS)系统,可视为该方法的一种朴素实现。实验结果表明,与L1为B语言的说话者所生成的实际A语言样本相比,合成输出语音的口音具有很高的自然度,且口音程度可控。