Protecting sensitive information is crucial in today's world of Large Language Models (LLMs) and data-driven services. One common method used to preserve privacy is by using data perturbation techniques to reduce overreaching utility of (sensitive) Personal Identifiable Information (PII) data while maintaining its statistical and semantic properties. Data perturbation methods often result in significant information loss, making them impractical for use. In this paper, we propose 'Life of PII', a novel Obfuscation Transformer framework for transforming PII into faux-PII while preserving the original information, intent, and context as much as possible. Our approach includes an API to interface with the given document, a configuration-based obfuscator, and a model based on the Transformer architecture, which has shown high context preservation and performance in natural language processing tasks and LLMs. Our Transformer-based approach learns mapping between the original PII and its transformed faux-PII representation, which we call "obfuscated" data. Our experiments demonstrate that our method, called Life of PII, outperforms traditional data perturbation techniques in terms of both utility preservation and privacy protection. We show that our approach can effectively reduce utility loss while preserving the original information, offering greater flexibility in the trade-off between privacy protection and data utility. Our work provides a solution for protecting PII in various real-world applications.
翻译:保护敏感信息在当今大语言模型和数据驱动服务的世界中至关重要。一种常用的隐私保护方法是通过数据扰动技术,在维持统计与语义属性的同时,降低(敏感)个人可识别信息数据的过度效用。数据扰动方法常导致显著的信息损失,使其难以实际应用。本文提出"PII的生活"——一种新型混淆转换器框架,旨在将PII转换为伪PII,同时尽可能保留原始信息、意图和上下文。我们的方法包括:与给定文档交互的API、基于配置的混淆器,以及基于Transformer架构的模型——该架构在自然语言处理任务和大语言模型中展现出强大的上下文保留能力与性能。我们的Transformer方法学习原始PII与其转换后的伪PII表示(称为"混淆"数据)之间的映射关系。实验表明,我们的方法"PII的生活"在效用保留与隐私保护两方面均优于传统数据扰动技术。我们证明该方法能有效减少效用损失同时保留原始信息,在隐私保护与数据效用之间提供更灵活的权衡。本研究为现实世界多种应用场景中的PII保护提供了解决方案。