Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.
翻译:现代语言模型(LM)在包含数百万条个人信息(PI)实例的大规模网络抓取数据上进行训练,其中许多信息被语言模型记忆,从而增加了隐私风险。在本研究中,我们开发了正则表达式与规则(R&R)检测器套件,用于检测电子邮件地址、电话号码和IP地址,其性能优于现有的最佳基于正则表达式的PI检测器。在一个手动整理的包含483个PI实例的数据集上,我们测量了记忆效应:发现Pythia-6.9b模型逐字复述了其中13.6%的实例,即当模型以原始文档中PI之前的标记作为提示时,贪婪解码会精确生成整个PI片段。我们将此分析扩展到Pythia模型套件中不同规模(160M-6.9B)和预训练步数(70k-143k次迭代)的模型,发现模型规模和预训练量均与记忆程度呈正相关。即使是最小的模型Pythia-160m,也能精确复述2.7%的实例。因此,我们强烈建议对预训练数据集进行严格过滤和匿名化处理,以最小化个人信息复述风险。