Language Models (LMs) have been shown to leak information about training data through sentence-level membership inference and reconstruction attacks. Understanding the risk of LMs leaking Personally Identifiable Information (PII) has received less attention, which can be attributed to the false assumption that dataset curation techniques such as scrubbing are sufficient to prevent PII leakage. Scrubbing techniques reduce but do not prevent the risk of PII leakage: in practice scrubbing is imperfect and must balance the trade-off between minimizing disclosure and preserving the utility of the dataset. On the other hand, it is unclear to which extent algorithmic defenses such as differential privacy, designed to guarantee sentence- or user-level privacy, prevent PII disclosure. In this work, we propose (i) a taxonomy of PII leakage in LMs, (ii) metrics to quantify PII leakage, and (iii) attacks showing that PII leakage is a threat in practice. Our taxonomy provides rigorous game-based definitions for PII leakage via black-box extraction, inference, and reconstruction attacks with only API access to an LM. We empirically evaluate attacks against GPT-2 models fine-tuned on three domains: case law, health care, and e-mails. Our main contributions are (i) novel attacks that can extract up to 10 times more PII sequences as existing attacks, (ii) showing that sentence-level differential privacy reduces the risk of PII disclosure but still leaks about 3% of PII sequences, and (iii) a subtle connection between record-level membership inference and PII reconstruction.
翻译:语言模型已被证明能够通过句子级别的成员推理和重构攻击泄露训练数据中的信息。关于语言模型泄露个人身份信息(PII)风险的研究相对较少,这归因于一个错误假设:数据集的清洗技术(如数据擦除)足以防止PII泄露。数据擦除技术虽然能降低但无法消除PII泄露风险——在实际应用中,数据擦除并不完美,且必须在最小化披露风险与保持数据集效用之间权衡。另一方面,旨在保障句子级别或用户级别隐私的算法防御措施(如差分隐私)能在多大程度上防止PII泄露尚不明确。本文提出:(i)语言模型中PII泄露的分类体系;(ii)量化PII泄露的指标;以及(iii)证明PII泄露在实践中构成威胁的攻击方法。我们的分类体系通过基于博弈论的严格定义,描述了仅通过API访问语言模型时的黑盒提取、推断和重构攻击所导致的PII泄露。我们针对在三个领域(判例法、医疗保健和电子邮件)微调的GPT-2模型进行了实证攻击评估。主要贡献包括:(i)提出新型攻击方法,其提取的PII序列数量比现有攻击高出一个数量级;(ii)证明句子级别差分隐私虽能降低PII泄露风险,但仍会泄露约3%的PII序列;(iii)揭示记录级别成员推断与PII重构之间微妙的关联性。