The challenge of generating and evolving real-life like synthetic test data without accessing real-world raw data -- a Systematic Review

Background: High-level system testing of applications that use data from e-Government services as input requires test data that is real-life-like but where the privacy of personal information is guaranteed. Applications with such strong requirement include information exchange between countries, medicine, banking, etc. This review aims to synthesize the current state-of-the-practice in this domain. Objectives: The objective of this Systematic Review is to identify existing approaches for creating and evolving synthetic test data without using real-life raw data. Methods: We followed well-known methodologies for conducting systematic literature reviews, including the ones from Kitchenham as well as guidelines for analysing the limitations of our review and its threats to validity. Results: A variety of methods and tools exist for creating privacy-preserving test data. Our search found 1,013 publications in IEEE Xplore, ACM Digital Library, and SCOPUS. We extracted data from 75 of those publications and identified 37 approaches that answer our research question partly. A common prerequisite for using these methods and tools is direct access to real-life data for data anonymization or synthetic test data generation. Nine existing synthetic test data generation approaches were identified that were closest to answering our research question. Nevertheless, further work would be needed to add the ability to evolve synthetic test data to the existing approaches. Conclusions: None of the publications really covered our requirements completely, only partially. Synthetic test data evolution is a field that has not received much attention from researchers but needs to be explored in Digital Government Solutions, especially since new legal regulations are being placed in force in many countries.

翻译：背景：对使用电子政务服务数据作为输入的应用程序进行高级系统测试，需要类真实且能确保个人信息隐私的测试数据。具有此类严格要求的应用包括国家间信息交换、医疗、银行等领域。本综述旨在综合该领域当前的最佳实践现状。目标：本系统性综述的目标是识别在不使用真实原始数据的情况下创建和演化合成测试数据的现有方法。方法：我们遵循了进行系统性文献综述的知名方法学，包括Kitchenham的方法，以及分析本综述局限性及其有效性威胁的指南。结果：存在多种用于创建隐私保护测试数据的方法和工具。我们在IEEE Xplore、ACM数字图书馆和SCOPUS中检索到1,013篇文献。我们从其中75篇文献中提取数据，并识别出37种部分回答我们研究问题的方法。使用这些方法和工具的一个常见前提是，需要直接访问真实数据以进行数据匿名化或合成测试数据生成。我们识别出九种最接近回答我们研究问题的现有合成测试数据生成方法。然而，要使现有方法具备演化合成测试数据的能力，还需要进一步的工作。结论：没有任何文献完全覆盖我们的需求，都只是部分满足。合成测试数据演化是一个尚未受到研究者太多关注但需要在数字政府解决方案中探索的领域，尤其是在许多国家正在实施新法律法规的背景下。