End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).
翻译:端到端全双工语音模型通过一个始终在线的LLM主干网络处理用户音频,但其隐藏表征对说话人隐私的影响尚未得到检验。遵循VoicePrivacy 2024协议并采用惰性知情攻击者设定,我们发现SALM-Duplex和Moshi的隐藏状态在所有Transformer层中均泄露了大量说话人身份信息。分层和分轮次分析表明,泄露现象在所有层中持续存在:SALM-Duplex在早期层表现出更强的泄露,而Moshi的泄露则相对均匀;同时,可链接性在最初几轮对话内急剧上升。我们提出了两种基于Stream-Voice-Anon的流式匿名化方案:波形级前端(Anon-W2W)和特征域替换(Anon-W2F)。Anon-W2F将等错误率相较于离散编码器基线提高了3.5倍以上(从11.2%升至41.0%),接近50%的随机猜测上限;而Anon-W2W在亚秒级响应延迟(FRL低于0.8秒)下,仍能保持基线sBERT性能的78-93%。