Large language models (LLMs) are becoming widely deployed as personal AI assistants with access to sensitive user data, making privacy a major challenge for their design and evaluation. Prior work focuses mainly on individual-level risks, overlooking \textbf{interdependent privacy (IDP)}--where one person's data may be revealed by others without their knowledge or consent. We address this gap by introducing \textbf{IDP-Bench}: the first LLM benchmark for IDP scenarios, grounded in the Contextual Integrity (CI) framework. We evaluate eight open-source LLMs on their understanding of IDP scenarios across three levels of IDP reasoning using two LLM judges. Results show strong co-ownership recognition (6/8 models exceed 90\%) but persistent weaknesses in identifying CI parameters (information attribute, primary subject) and IDP-specific parameters such as secondary subjects, where 7/8 models score below 74\%. Models also struggle to judge sharing appropriateness (5/8 scoring below 77\%). While the ability to judge the appropriateness of sharing improves with scale, performance tends to decline in smaller models, and prompt sensitivity remains high on IDP-specific questions--highlighting the need for more targeted study of IDP in LLM privacy research. Data \& code available \href{https://github.com/tisl-lab/Interdependent_Privacy_Bench}{here}.
翻译:大语言模型正被广泛部署为个人AI助手,能够访问敏感用户数据,这使得隐私保护成为其设计与评估中的重大挑战。先前的研究主要关注个体层面风险,忽视了**相互依赖隐私(IDP)**——即一个人的数据可能在他人不知情或未同意的情况下被披露。我们通过引入**IDP-Bench**填补了这一空白:这是首个针对IDP场景的大语言模型基准测试,基于情境完整性(CI)框架构建。我们使用两种大语言模型评估器,对八个开源大语言模型在三个IDP推理层级上的理解能力进行了评估。结果显示,模型在共同所有权识别方面表现强劲(6/8模型超过90%),但在识别CI参数(信息属性、主要主体)及IDP特定参数(如次要主体)方面存在持续弱点——其中7/8模型的得分低于74%。模型在判断共享适当性方面也存在困难(5/8模型得分低于77%)。虽然判断共享适当性的能力随模型规模增大而提升,但较小模型的性能普遍下降,且在对IDP特定问题的提示敏感性仍较高——这凸显了在大语言模型隐私研究中需对IDP进行更有针对性研究的必要性。数据与代码见\href{https://github.com/tisl-lab/Interdependent_Privacy_Bench}{此处}。