We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.
翻译:我们提出通用语音内容因式分解(USCF),这是一种简单且可逆的线性方法,用于提取低秩语音表示,其中说话人音色被抑制而语音内容得以保留。USCF将语音内容因式分解这种闭集语音转换(VC)方法扩展到开集场景,通过最小二乘优化学习通用的语音到内容映射,并仅需数秒目标语音即可推导出说话人特定变换。通过嵌入分析表明,USCF有效去除了说话人依赖的变化。作为零样本语音转换系统,USCF在可懂度、自然度和说话人相似度方面与需要更多目标说话人数据或额外神经训练的方法相比具有竞争力。最后,我们证明作为训练高效的音色解耦语音特征,USCF特征可作为声学表示用于训练音色提示文本转语音模型。语音样本和代码已公开提供。