VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user's confidential schedule to another, a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextual privacy-sensitive information (e.g., a user's private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most open-source models perform close to random chance (around 50% accuracy) on conditional privacy decisions, while even strong closed-source systems fall short on proactive privacy inference. We further validate these findings on Real-VoxPrivacy, a human-recorded subset, confirming that failures observed on synthetic data persist in real speech. Finally, we demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve privacy-preserving abilities while maintaining robustness. To support future work, we release the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to foster the development of safer and more context-aware SLMs.

翻译：随着语音语言模型（SLMs）从个人设备转向智能家居等多用户共享环境，一个新的挑战随之出现：模型需要区分不同用户以妥善管理信息流。若缺乏此能力，SLM可能将某用户的机密日程泄露给他人，这种隐私失效现象我们称之为交互隐私。因此，生成说话者感知响应的能力对于SLM的安全部署至关重要。现有SLM基准测试对话能力却忽略说话者身份；多说话者基准仅检测"谁说了什么"而未评估SLM是否调整其响应；隐私基准聚焦全局敏感数据（如银行密码）而忽视上下文相关的隐私敏感信息（如用户的私人预约）。为填补这一空白，我们提出VoxPrivacy——首个专为评估SLMs交互隐私设计的基准。VoxPrivacy涵盖三个难度层级，从遵循直接保密指令到主动隐私保护。我们在32小时双语数据集上对九个SLMs进行评估，发现普遍存在的脆弱性：大多数开源模型在条件隐私决策上的表现接近随机水平（约50%准确率），而即使是强大的闭源系统在主动隐私推理方面也存在不足。我们进一步在人工录制的子集Real-VoxPrivacy上验证这些发现，证实合成数据中观察到的缺陷在真实语音中依然存在。最后，我们展示了一条可行的改进路径：通过在新建的4,000小时训练集上进行微调，我们在保持鲁棒性的同时提升了隐私保护能力。为支持未来研究，我们公开发布VoxPrivacy基准、大规模训练集及微调模型，以促进开发更安全、更具上下文感知能力的SLMs。