Large language models (LLMs) are rapidly being adopted for tasks like drafting emails, summarizing meetings, and answering health questions. In these settings, users may need to share private information (e.g., contact details, health records). To evaluate LLMs' ability to identify and redact such information, prior work introduced real-life, scenario-based benchmarks (e.g., ConfAIde, PrivacyLens) and found that LLMs can leak private information in complex scenarios. However, these evaluations relied on proxy LLMs to judge the helpfulness and privacy-preservation quality of LLM responses, rather than directly measuring users' perceptions. To understand how users perceive the helpfulness and privacy-preservation quality of LLM responses to privacy-sensitive scenarios, we conducted a user study ($n=94$) using 90 PrivacyLens scenarios. We found that users had low agreement with each other when evaluating identical LLM responses. In contrast, five proxy LLMs reached high agreement, yet each proxy LLM had low correlation with users' evaluations. These results indicate that proxy LLMs cannot accurately estimate users' wide range of perceptions of utility and privacy in privacy-sensitive scenarios. We discuss the need for more user-centered studies to measure LLMs' ability to help users while preserving privacy, and for improving alignment between LLMs and users in estimating perceived privacy and utility.
翻译:大型语言模型正被迅速应用于起草电子邮件、总结会议和回答健康问题等任务。在这些场景中,用户可能需要分享私人信息。为评估LLMs识别和编辑此类信息的能力,先前研究引入了基于真实场景的基准测试,并发现在复杂场景中LLMs可能泄露私人信息。然而,这些评估依赖代理LLMs来判断LLM响应的帮助性和隐私保护质量,而非直接测量用户的感知。为理解用户如何感知LLM对隐私敏感场景响应的帮助性和隐私保护质量,我们使用90个PrivacyLens场景开展了用户研究。研究发现,用户在评估相同的LLM响应时彼此间一致性较低。相比之下,五个代理LLMs达到了高度一致,但每个代理LLM与用户评估的相关性均较低。这些结果表明,代理LLMs无法准确估计用户在隐私敏感场景中对实用性和隐私的广泛感知。我们讨论了开展更多以用户为中心的研究以衡量LLMs在保护隐私的同时帮助用户的能力,以及改进LLMs与用户在评估感知隐私和效用方面对齐的必要性。