Accurate recognition of personally identifiable information (PII) is central to automated text anonymization. This paper investigates the effectiveness of cross-domain model transfer, multi-domain data fusion, and sample-efficient learning for PII recognition. Using annotated corpora from healthcare (I2B2), legal (TAB), and biography (Wikipedia), we evaluate models across four dimensions: in-domain performance, cross-domain transferability, fusion, and few-shot learning. Results show legal-domain data transfers well to biographical texts, while medical domains resist incoming transfer. Fusion benefits are domain-specific, and high-quality recognition is achievable with only 10% of training data in low-specialization domains.
翻译:个人可识别信息的精确识别是文本自动匿名化的核心。本文研究了跨领域模型迁移、多领域数据融合以及样本高效学习在PII识别中的有效性。利用来自医疗、法律与传记领域的标注语料,我们从四个维度评估模型性能:领域内表现、跨领域可迁移性、数据融合效果以及小样本学习能力。实验结果表明,法律领域数据能够有效迁移至传记文本,而医疗领域则表现出较强的领域封闭性。融合策略的收益具有领域特异性,且在专业化程度较低的领域中,仅使用10%的训练数据即可实现高质量的识别效果。