Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics

Privacy poses a significant obstacle to the progress of learning analytics (LA), presenting challenges like inadequate anonymization and data misuse that current solutions struggle to address. Synthetic data emerges as a potential remedy, offering robust privacy protection. However, prior LA research on synthetic data lacks thorough evaluation, essential for assessing the delicate balance between privacy and data utility. Synthetic data must not only enhance privacy but also remain practical for data analytics. Moreover, diverse LA scenarios come with varying privacy and utility needs, making the selection of an appropriate synthetic data approach a pressing challenge. To address these gaps, we propose a comprehensive evaluation of synthetic data, which encompasses three dimensions of synthetic data quality, namely resemblance, utility, and privacy. We apply this evaluation to three distinct LA datasets, using three different synthetic data generation methods. Our results show that synthetic data can maintain similar utility (i.e., predictive performance) as real data, while preserving privacy. Furthermore, considering different privacy and data utility requirements in different LA scenarios, we make customized recommendations for synthetic data generation. This paper not only presents a comprehensive evaluation of synthetic data but also illustrates its potential in mitigating privacy concerns within the field of LA, thus contributing to a wider application of synthetic data in LA and promoting a better practice for open science.

翻译：隐私问题对学习分析（LA）的进展构成重大障碍，表现为现有解决方案难以妥善处理的匿名化不足和数据滥用等挑战。合成数据作为一种潜在解决方案，能够提供强大的隐私保护。然而，现有LA领域中关于合成数据的研究缺乏必要的全面评估，而评估对于权衡隐私与数据效用之间的微妙平衡至关重要。合成数据不仅需要增强隐私保护，还必须保持数据分析的实际可用性。此外，不同的LA场景对隐私和效用的需求各异，这使得选择适当的合成数据方法成为一项紧迫挑战。为弥补这些不足，我们提出针对合成数据的全面评估框架，涵盖合成数据质量的三个维度：相似性、效用性和隐私性。我们将此评估应用于三个不同的LA数据集，并采用三种不同的合成数据生成方法。结果表明，合成数据能在保护隐私的同时保持与真实数据相似的效用（即预测性能）。进一步地，针对不同LA场景中的隐私与数据效用需求差异，我们提出定制化的合成数据生成建议。本文不仅提供了合成数据的全面评估，还阐明了其在缓解LA领域隐私问题方面的潜力，从而推动合成数据在LA中的广泛应用，并促进开放科学的更好实践。