Sensitive personal information can appear in large-scale pre-training corpora for large language models (LLMs). Detecting and filtering such information is therefore essential to ensure compliance with privacy regulations and prevent unintended information leakage. However, in contrast to English and other languages, research into sensitive personal information has been limited in the Japanese language. In this study, we focus on sensitive personal data defined as special care-required personal information (SCPI) under Japan's Act on the Protection of Personal Information (APPI). We construct an SCPI dataset using LLM-based annotation and train machine learning models to rapidly detect SCPI in text. As a result, our SCPI classifier can effectively identify information related to SCPI. This study is the first to explore SCPI detection in Japanese text corpora, highlighting the challenges of accurate detection.
翻译:敏感个人信息可能出现在大型语言模型的大规模预训练语料库中。因此,检测并过滤此类信息对于确保遵守隐私法规、防止意外信息泄露至关重要。然而,与英语及其他语言相比,针对日语的敏感个人信息研究仍十分有限。本研究聚焦于日本《个人信息保护法》中定义为"需特别注意的个人信息"的敏感个人数据。我们利用基于大语言模型的标注方法构建了需特别注意的个人信息数据集,并训练机器学习模型以快速检测文本中的需特别注意的个人信息。实验结果表明,我们的需特别注意的个人信息分类器能够有效识别与需特别注意的个人信息相关的信息。本研究首次探索了日语文本语料库中的需特别注意的个人信息检测,揭示了准确检测所面临的挑战。