Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.

翻译：保护说话人身份对于在线语音应用至关重要，然而流式说话人匿名化（SA）的研究仍显不足。近期研究表明，神经音频编解码器（NAC）能够提供优异的说话人特征解耦和语言保真度。NAC还可与因果语言模型（LM）结合使用，以增强流式任务中的语言保真度和提示控制能力。然而，现有的基于NAC的在线LM系统专为语音转换（VC）而非匿名化设计，缺乏隐私保护所需的技术。基于这些进展，我们提出了Stream-Voice-Anon，该系统通过集成匿名化技术，将现代基于因果LM的NAC架构专门适配于流式SA。我们的匿名化方法结合了伪说话人表征采样、说话人嵌入混合以及用于LM条件化的多样化提示选择策略，这些策略利用量化内容码的解耦特性来防止说话人信息泄露。此外，我们比较了动态与固定延迟配置，以探索实时场景中的延迟-隐私权衡。根据VoicePrivacy 2024挑战赛的评测协议，与先前最先进的流式方法DarkStream相比，Stream-Voice-Anon在可懂度（相对词错误率降低高达46%）和情感保留（相对未加权平均召回率提升高达28%）方面取得了显著提升，同时保持了相当的延迟（180ms对比200ms）以及对惰性知情攻击者的隐私保护能力，尽管在面对半知情攻击者时显示出15%的相对性能下降。