We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o, a state-of-the-art Large Language Model. A key innovation of our method is the incorporation of contextual information: in addition to a feature's name and values, we utilize information from other feature names within the dataset as well as the dataset description. We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets: DeSSI, a large synthetic dataset, datasets we collected from Kaggle and OpenML as well as MIMIC-Demo-Ext, a real-world dataset containing patient information from critical care units. Our findings reveal that detection performance varies significantly depending on the dataset used for evaluation. CASSED excels on DeSSI, the dataset on which it was trained. Performance on the medical dataset MIMIC-Demo-Ext is comparable across all models, with our GPT-4o-based approach clearly outperforming the others. Notably, personal data detection in the Kaggle and OpenML datasets appears to benefit from contextual information. This is evidenced by the poor performance of CASSED and Presidio (both of which do not utilize the context of the dataset) compared to the strong results of our GPT-4o-based approach. We conclude that further progress in this field would greatly benefit from the availability of more real-world datasets containing personal information.
翻译:我们提出了一种新颖的方法,用于检测结构化数据集中的个人数据,该方法利用了最先进的大型语言模型GPT-4o。我们方法的一个关键创新是结合了上下文信息:除了特征名称和值之外,我们还利用了数据集中其他特征名称的信息以及数据集描述。我们将我们的方法与替代方法(包括Microsoft Presidio和CASSED)进行了比较,并在多个数据集上进行了评估:大型合成数据集DeSSI、我们从Kaggle和OpenML收集的数据集,以及包含重症监护病房患者信息的真实世界数据集MIMIC-Demo-Ext。我们的研究结果表明,检测性能因所使用的评估数据集而有显著差异。CASSED在其训练数据集DeSSI上表现出色。在医疗数据集MIMIC-Demo-Ext上,所有模型的性能相当,而我们基于GPT-4o的方法明显优于其他方法。值得注意的是,在Kaggle和OpenML数据集中的个人数据检测似乎受益于上下文信息。这一点通过CASSED和Presidio(两者均未利用数据集上下文)的较差性能,与我们基于GPT-4o的方法的强劲结果对比得以证明。我们得出结论,该领域的进一步进展将极大地受益于更多包含个人信息的真实世界数据集的可用性。