VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings

from arxiv, Accepted to IEEE ICASSP 2026 (51st International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2026). 5 pages, 1 figure, 3 tables. Project page: https://vcbsl.github.io/VoxMorph/

Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network. VoxMorph achieves state-of-the-art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds. This work establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security. The code and dataset are available on our project page: https://vcbsl.github.io/VoxMorph/

翻译：融合技术通过组合多个个体的特征生成人工生物特征样本，使得每个贡献者都能通过单一注册模板进行验证。尽管该技术在面部识别领域已得到广泛研究，但在语音生物识别中的潜在脆弱性仍很大程度上未被探索。现有的语音融合方法计算成本高昂、可扩展性差，且仅限于声学特征相似的身份对，限制了其实际部署。此外，现有的声音融合方法主要针对音频纹理、音乐或环境声音，无法迁移至语音身份操控任务。我们提出了VoxMorph，一个零样本框架，仅需每个对象五秒的音频即可生成高保真语音融合样本，且无需重新训练模型。我们的方法将语音特征解耦为韵律和音色嵌入，实现了对说话风格和身份的细粒度插值。这些嵌入通过球面线性插值（Slerp）进行融合，并利用自回归语言模型与条件流匹配网络进行合成。VoxMorph实现了最先进的性能，在严格安全阈值下，其音频质量提升了2.6倍，可懂度错误降低了73%，在自动说话人验证系统上的融合攻击成功率达到了67.8%。这项工作为语音融合建立了一个实用且可扩展的范式，对生物特征安全领域具有重要影响。代码和数据集已在项目页面发布：https://vcbsl.github.io/VoxMorph/