To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy trade-off, operating at fixed design points. We propose DiffAnon, a diffusion-based anonymization method with classifier-free guidance (CFG) that provides explicit, continuous inference-time control over prosody preservation. DiffAnon refines acoustic detail over semantic embeddings of an RVQ codec, enabling smooth interpolation between anonymization strength and prosodic fidelity within a single model. To the best of our knowledge, it is the first voice anonymization framework to provide structured, interpolatable inference-time prosody control. Experiments demonstrate structured trade-off behavior, achieving strong utility while maintaining competitive privacy across controllable operating points.
翻译:保留还是舍弃韵律是语音匿名化中的核心问题。韵律既能传递语义与情感,又与说话者身份紧密耦合。现有方法要么为了隐私牺牲韵律,要么缺乏调控效用-隐私权衡的原则性机制,只能运行在固定设计点。我们提出DiffAnon——一种基于扩散的匿名化方法,采用无分类器引导(CFG),在推理阶段对韵律保留实现显式、连续的调控。DiffAnon通过RVQ编解码器的语义嵌入精炼声学细节,使得单一模型能够在匿名化强度与韵律保真度之间实现平滑插值。据我们所知,这是首个提供结构化、可插值推理时韵律控制的语音匿名化框架。实验表明该方法展现出结构化的权衡行为,在可调控的操作点上既保持了竞争力隐私保护,又实现了强效用。