We introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear transformations that adapt to heterogeneous feature-space structure while remaining analytically tractable. Through objective and subjective evaluations, we show that SSL-GMMVC improves speaker similarity with comparable intelligibility and naturalness, and that even a constrained covariance variant surpasses a deep learning baseline as the number of mixture components increases. Further analyses link component selection to phonetic structure and reveal interpretable scaling and rotation in the learned transforms. These findings highlight SSL-GMMVC as an effective, analyzable framework for voice conversion.
翻译:本文提出SSL-GMMVC,一种在自监督语音空间中实现可解释语音转换的方法。该方法利用高斯混合模型对配对源-目标特征进行建模,并通过仿射变换的后验加权求和实现转换。该过程在学习适应异质特征空间结构的局部线性变换的同时,保持了分析可处理性。通过客观与主观评估,我们证明SSL-GMMVC在保持可比拟清晰度和自然度的前提下提升了说话人相似度;且随着混合分量数量的增加,即使协方差约束变体也能超越深度学习基线。进一步分析揭示了分量选择与语音结构间的关联,并展现出所学变换中可解释的缩放与旋转特性。这些发现凸显了SSL-GMMVC作为一种有效且可分析的语音转换框架的价值。