The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. To do so effectively, we observe the need to refine and broaden our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Following this definition, we introduce a contextual data sensitivity framework building on two core concepts: 1) type contextualization, which considers the type of the data values at hand within the overall context of the dataset or document to assess their true sensitivity, and 2) domain contextualization, which assesses the sensitivity of data values informed by domain-specific information external to the dataset, such as geographic origin of a dataset. Experiments instrumented with language models confirm that: 1) type-contextualization significantly reduces the number of false positives for type-based sensitive data detection and reaches a recall of 94% compared to 63% with commercial tools, and 2) domain-contextualization leveraging sensitivity rule retrieval effectively grounds sensitive data detection in relevant context in non-standard data domains. A case study with humanitarian data experts also illustrates that context-grounded explanations provide useful guidance in manual data auditing processes. We open-source the implementation of the mechanisms and annotated datasets at https://github.com/trl-lab/sensitive-data-detection.
翻译:随着开放数据门户的兴起,在数据集发布与交换前保护敏感数据的需求日益凸显。为实现有效保护,我们认识到需要细化并拓宽敏感数据的定义,并主张数据的敏感性取决于其上下文。基于此定义,我们提出一个上下文数据敏感性框架,该框架建立在两个核心概念之上:1) 类型上下文化——通过考量数据值在数据集或文档整体上下文中的类型来评估其真实敏感性;2) 领域上下文化——借助数据集外部的领域特定信息(如数据集的地理来源)来评估数据值的敏感性。基于语言模型的实验验证表明:1) 类型上下文化能显著降低基于类型的敏感数据检测的误报数量,其召回率达到94%,而商业工具仅为63%;2) 利用敏感性规则检索的领域上下文化能有效将敏感数据检测锚定于非标准数据领域的相关上下文中。一项与人道主义数据专家的案例研究进一步表明,基于上下文的解释能为人工数据审计过程提供有效指导。我们已在 https://github.com/trl-lab/sensitive-data-detection 开源相关机制实现及标注数据集。