Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.
翻译:涉及隐私敏感数据的研究长期受限于数据稀缺性,这与受益于数据规模化的其他领域形成鲜明对比。随着现代AI智能体(如OpenClaw和Gemini Agent)被授予对高度敏感个人信息的持续访问权限,这一挑战正变得日益紧迫。为应对这一长期瓶颈及不断增长的风险,我们提出了Privasis(即隐私绿洲),这是首个完全从零构建的百万级全合成数据集——一个包含丰富多样隐私信息的海量文本库——旨在拓宽和加速那些必须处理敏感社会数据的研究领域。与现有数据集相比,Privasis包含140万条记录,在保证质量的前提下实现了数量级的规模扩展,并在医疗记录、法律文件、财务报告、日程安排和短信等多种文档类型上展现出远高于现有数据集的多样性,总计包含5510万个标注属性(如种族、出生日期、工作单位等)。我们利用Privasis构建了用于文本脱敏的平行语料库,通过分解文本并实施定向脱敏的处理流程。基于该数据集训练的紧凑型脱敏模型(≤40亿参数)在性能上超越了GPT-5和Qwen-3 235B等最先进的大型语言模型。我们将公开数据、模型及代码,以加速隐私敏感领域及智能体研究的未来发展。