Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

Hyunwoo Kim,Niloofar Mireshghallah,Michael Duan,Rui Xin,Shuyue Stella Li,Jaehun Jung,David Acuna,Qi Pang,Hanshen Xiao,G. Edward Suh,Sewoong Oh,Yulia Tsvetkov,Pang Wei Koh,Yejin Choi

from arxiv, For code and data, see https://privasis.github.io

Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.

翻译：涉及隐私敏感数据的研究长期受限于数据稀缺性，这与受益于数据规模化的其他领域形成鲜明对比。随着现代AI智能体（如OpenClaw和Gemini Agent）被授予对高度敏感个人信息的持续访问权限，这一挑战正变得日益紧迫。为应对这一长期瓶颈及不断增长的风险，我们提出了Privasis（即隐私绿洲），这是首个完全从零构建的百万级全合成数据集——一个包含丰富多样隐私信息的海量文本库——旨在拓宽和加速那些必须处理敏感社会数据的研究领域。与现有数据集相比，Privasis包含140万条记录，在保证质量的前提下实现了数量级的规模扩展，并在医疗记录、法律文件、财务报告、日程安排和短信等多种文档类型上展现出远高于现有数据集的多样性，总计包含5510万个标注属性（如种族、出生日期、工作单位等）。我们利用Privasis构建了用于文本脱敏的平行语料库，通过分解文本并实施定向脱敏的处理流程。基于该数据集训练的紧凑型脱敏模型（≤40亿参数）在性能上超越了GPT-5和Qwen-3 235B等最先进的大型语言模型。我们将公开数据、模型及代码，以加速隐私敏感领域及智能体研究的未来发展。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

OpenEarthAgent：一种面向工具增强型地理空间智能体的统一框架

专知会员服务

16+阅读 · 2月20日

【新书】差分隐私实战：使用OpenDP进行理论与实践介绍，389页pdf

专知会员服务

29+阅读 · 2024年5月29日

【2023新书】实用数据隐私:增强数据的隐私性和安全性，599页pdf

专知会员服务

83+阅读 · 2023年5月1日

重磅！《隐私计算白皮书（2022年）》正式发布，65页pdf

专知会员服务

85+阅读 · 2023年1月2日