Accessibility of research data is critical for advances in many research fields, but textual data often cannot be shared due to the personal and sensitive information which it contains, e.g names or political opinions. General Data Protection Regulation (GDPR) suggests pseudonymization as a solution to secure open access to research data, but we need to learn more about pseudonymization as an approach before adopting it for manipulation of research data. This paper outlines a research agenda within pseudonymization, namely need of studies into the effects of pseudonymization on unstructured data in relation to e.g. readability and language assessment, as well as the effectiveness of pseudonymization as a way of protecting writer identity, while also exploring different ways of developing context-sensitive algorithms for detection, labelling and replacement of personal information in unstructured data. The recently granted project on pseudonymization Grandma Karl is 27 years old addresses exactly those challenges.
翻译:研究数据的可获取性对众多领域的进步至关重要,但文本数据常因包含姓名、政治观点等个人敏感信息而无法共享。《通用数据保护条例》(GDPR)建议将假名化作为保障研究数据开放获取的解决方案,但在采用该方法处理研究数据之前,我们需要更深入地了解其运作机制。本文概述了假名化领域的研究议程,主要包括:探究假名化对非结构化数据的影响(例如对可读性和语言评估的影响),以及假名化作为保护作者身份手段的有效性;同时探索开发上下文敏感算法的多种途径,用于检测、标注和替换非结构化数据中的个人信息。新近获批的"卡尔奶奶27岁"假名化项目正是针对这些挑战展开研究。