Recent advances in Natural Language Processing have demonstrated the effectiveness of pretrained language models like BERT for a variety of downstream tasks. We present GiusBERTo, the first BERT-based model specialized for anonymizing personal data in Italian legal documents. GiusBERTo is trained on a large dataset of Court of Auditors decisions to recognize entities to anonymize, including names, dates, locations, while retaining contextual relevance. We evaluate GiusBERTo on a held-out test set and achieve 97% token-level accuracy. GiusBERTo provides the Italian legal community with an accurate and tailored BERT model for de-identification, balancing privacy and data protection.
翻译:自然语言处理领域的最新进展已证明,诸如BERT等预训练语言模型在多种下游任务中的有效性。本文提出GiusBERTo,这是首个专门用于意大利法律文书中个人数据匿名化的基于BERT的模型。GiusBERTo基于大规模审计法院判决数据集进行训练,能够识别需匿名化的实体(包括姓名、日期、地点等),同时保持上下文相关性。我们在保留测试集上对GiusBERTo进行评估,取得了97%的令牌级准确率。GiusBERTo为意大利法律界提供了一个精准且定制化的BERT模型,用于实现去识别化,在隐私保护与数据利用之间取得平衡。