A Synthetic Dataset for Personal Attribute Inference

Recently, powerful Large Language Models (LLMs) have become easily accessible to hundreds of millions of users world-wide. However, their strong capabilities and vast world knowledge do not come without associated privacy risks. In this work, we focus on the emerging privacy threat LLMs pose -- the ability to accurately infer personal information from online texts. Despite the growing importance of LLM-based author profiling, research in this area has been hampered by a lack of suitable public datasets, largely due to ethical and privacy concerns associated with real personal data. We take two steps to address this problem: (i) we construct a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles; (ii) using this framework, we generate SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for personal attributes. We validate our dataset with a human study showing that humans barely outperform random guessing on the task of distinguishing our synthetic comments from real ones. Further, we verify that our dataset enables meaningful personal attribute inference research by showing across 18 state-of-the-art LLMs that our synthetic comments allow us to draw the same conclusions as real-world data. Combined, our experimental results, dataset and pipeline form a strong basis for future privacy-preserving research geared towards understanding and mitigating inference-based privacy threats that LLMs pose.

翻译：近年来，功能强大的大型语言模型（LLMs）已为全球数亿用户便捷使用。然而，其强大的能力与广泛的世界知识并非没有伴随的隐私风险。本研究聚焦于LLMs带来的新兴隐私威胁——从在线文本中准确推断个人信息的能力。尽管基于LLM的作者画像分析日益重要，但该领域的研究一直因缺乏合适的公开数据集而受阻，这主要源于真实个人数据所涉及的伦理与隐私问题。我们采取两个步骤来解决此问题：（i）利用基于合成个人档案生成的LLM智能体，为流行的社交媒体平台Reddit构建仿真框架；（ii）通过该框架，我们生成了SynthPAI——一个包含7800余条人工标注个人属性的评论的多样化合成数据集。我们通过人工研究验证了数据集的有效性，结果表明人类在区分合成评论与真实评论的任务中表现仅略优于随机猜测。此外，我们验证了该数据集能够支持有意义的个人属性推断研究：通过对18个前沿LLM的实验证明，使用合成评论所得出的结论与使用真实数据时一致。综合来看，我们的实验结果、数据集及处理流程为未来面向隐私保护的研究奠定了坚实基础，有助于理解和缓解LLMs带来的基于推断的隐私威胁。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日