Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.
翻译:语音克隆常从整体质量角度评估,但关于口音保留及其感知后果的研究尚不充分。本研究采用计算与感知相结合的设计,对比标准汉语与重度口音汉语语音及其克隆。基于嵌入的分析显示,不同系统中原始语音与克隆语音之间的距离在口音组与标准组之间无可靠差异。在感知研究中,标准口音说话者的克隆语音被认为比口音说话者的克隆语音更接近原始语音;从原始语音到克隆语音,可懂度提升,且口音语音的提升幅度更大。这些结果表明,即使口音差异未体现在现成的说话人嵌入距离中,口音变化仍可能对语音克隆中的身份匹配感知与可懂度产生影响,并启示应将说话人身份保留与口音保留作为独立的评估维度。