Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.
翻译:大语言模型(LLMs)正日益在特定领域的数据集上进行微调,以支持医疗、金融和法律等领域的应用。这些微调数据集通常具有敏感且机密的**数据集层面属性**——例如患者人口统计学特征或疾病流行率——这些属性本不应被泄露。尽管先前的研究已针对判别式模型(如图像分类模型)和生成式模型(如用于图像数据的GANs)的属性推断攻击进行了探讨,但此类攻击是否适用于LLMs尚不明确。在本研究中,我们提出了**PropInfer**,这是一个用于评估LLMs在两种微调范式(问答与聊天补全)下属性推断能力的基准任务。基于ChatDoctor数据集构建,我们的基准涵盖多种属性类型和任务配置。我们进一步提出了两种定制化攻击方法:基于提示的生成攻击和利用词频信号的影子模型攻击。在多个预训练LLMs上的实证评估证明了我们攻击方法的有效性,揭示了LLMs中一个先前未被认识到的安全漏洞。