Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.
翻译:语音匿名化旨在隐藏说话人身份线索的同时保留语言内容。当前基于机器学习的方法需要大量计算资源,阻碍了实时流式应用的发展。为解决这些问题,我们提出一种可实现低延迟说话人匿名化的流式模型。该系统采用端到端自编码器架构进行训练:使用轻量级内容编码器提取类HuBERT信息,预训练说话人编码器提取说话人身份特征,以及方差编码器注入基频和能量信息。这三种解耦的表征被馈送至解码器以重新合成语音信号。我们展示了该系统的两种实现版本的评估结果:完整模型实现230ms延迟,精简版本(体积缩小至0.1倍)在保持自然度、可懂度和隐私保护性能达到最先进水平的同时,进一步将延迟降低至66ms。