Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests.
翻译:开放科学是推动科学进步与合作的基本支柱,其基础在于开放数据、开源软件与开放获取的原则。然而,在严格遵守数据保护法规的前提下,满足开放数据发布与共享的要求在许多情况下难以实现。因此,研究人员需要依赖经过验证的方法,使其能够在无需与第三方共享数据的前提下完成数据匿名化处理。为此,本文提出了一种用于敏感表格数据匿名化的Python库实现。该框架为用户提供了多种可应用于给定数据集的匿名化方法,包括标识符集合、准标识符、泛化层次结构及允许的抑制级别,同时涵盖敏感属性与所需匿名化等级的设定。该库遵循集成与持续开发的最佳实践,并采用基于单元测试与功能测试的代码覆盖率验证工作流。