Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (https://ntt-hilab-gensp.github.io/is2025voiceimpression/).
翻译:语音中的副语言/非语言信息对于塑造听者的印象至关重要。尽管零样本文本转语音(TTS)技术已实现较高的说话人保真度,但通过调节细微的副语言/非语言信息来控制感知到的声音特征(即印象)仍然具有挑战性。为此,我们开发了一种用于零样本TTS的音色印象控制方法,该方法利用一个低维向量来表示各种音色印象对(例如,低沉-明亮)的强度。客观和主观评估结果均证明了我们方法在印象控制方面的有效性。此外,通过大型语言模型生成该向量,能够根据对目标印象的自然语言描述生成目标印象向量,从而无需手动优化。音频示例可在我们的演示页面(https://ntt-hilab-gensp.github.io/is2025voiceimpression/)上获取。