Self-supervised speech models learn representations that capture both content and speaker information. Yet this entanglement creates problems: content tasks suffer from speaker bias, and privacy concerns arise when speaker identity leaks through supposedly anonymized representations. We present two contributions to address these challenges. First, we develop InterpTRQE-SptME (Timbre Residual Quantitative Evaluation Benchmark of Speech pre-training Models Encoding via Interpretability), a benchmark that directly measures residual speaker information in content embeddings using SHAP-based interpretability analysis. Unlike existing indirect metrics, our approach quantifies the exact proportion of speaker information remaining after disentanglement. Second, we propose InterpTF-SptME, which uses these interpretability insights to filter speaker information from embeddings. Testing on VCTK with seven models including HuBERT, WavLM, and ContentVec, we find that SHAP Noise filtering reduces speaker residuals from 18.05% to nearly zero while maintaining recognition accuracy (CTC loss increase under 1%). The method is model-agnostic and requires no retraining.
翻译:自监督语音模型学习到的表征同时包含内容和说话人信息。然而这种纠缠带来了问题:内容任务受说话人偏差影响,且当说话人身份通过本应匿名的表征泄露时会产生隐私担忧。我们提出两项贡献来解决这些挑战。首先,我们开发了InterpTRQE-SptME(基于可解释性的语音预训练模型编码残差音色定量评估基准),该基准利用基于SHAP的可解释性分析直接衡量内容嵌入中的残差说话人信息。与现有间接指标不同,我们的方法可量化解纠缠后仍保留的说话人信息精确比例。其次,我们提出InterpTF-SptME,利用这些可解释性见解从嵌入中过滤说话人信息。在包含HuBERT、WavLM和ContentVec等七种模型的VCTK数据集上测试发现,SHAP噪声过滤可将说话人残差从18.05%降至接近零,同时保持识别准确率(CTC损失增幅低于1%)。该方法为模型无关型且无需重新训练。