Protecting Personal Identifiable Information (PII) in text data is crucial for privacy, but current PII generalization methods face challenges such as uneven data distributions and limited context awareness. To address these issues, we propose two approaches: a feature-based method using machine learning to improve performance on structured inputs, and a novel context-aware framework that considers the broader context and semantic relationships between the original text and generalized candidates. The context-aware approach employs Multilingual-BERT for text representation, functional transformations, and mean squared error scoring to evaluate candidates. Experiments on the WikiReplace dataset demonstrate the effectiveness of both methods, with the context-aware approach outperforming the feature-based one across different scales. This work contributes to advancing PII generalization techniques by highlighting the importance of feature selection, ensemble learning, and incorporating contextual information for better privacy protection in text anonymization.
翻译:保护文本数据中的个人可识别信息(PII)对隐私至关重要,但当前的PII泛化方法面临数据分布不均和上下文感知有限等挑战。为解决这些问题,我们提出两种方法:一种基于特征的方法,利用机器学习提升结构化输入的性能;以及一种新颖的上下文感知框架,该框架考虑更广泛的上下文及原始文本与泛化候选之间的语义关系。上下文感知方法采用Multilingual-BERT进行文本表示、功能转换和均方误差评分以评估候选方案。在WikiReplace数据集上的实验证明了两种方法的有效性,其中上下文感知方法在不同规模上均优于基于特征的方法。本研究通过强调特征选择、集成学习以及融入上下文信息对提升文本匿名化中隐私保护的重要性,推动了PII泛化技术的发展。