To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.
翻译:为确保司法公开与个人数据保护之间的平衡,韩国司法机构要求法院判决书在公开披露前必须进行去标识化处理。然而,当前的去标识化流程难以在遵循严格法律要求的同时大规模处理法院判决书。此外,个人标识符的法律定义与分类较为模糊,不适用于技术解决方案。为应对这些挑战,我们提出一种符合相关法律与实践的去标识化框架Thunder-DeID。具体而言,我们(i)构建并发布了首个包含标注判决书及对应实体提及列表的韩语法律数据集,(ii)提出了系统化的个人可识别信息(PII)分类体系,并(iii)开发了基于端到端深度神经网络(DNN)的去标识化流程。实验结果表明,我们的模型在法院判决书去标识化任务中达到了最先进的性能水平。