Training-Free Private Synthesis with Validation: A New Frontier for Practical Educational Data Sharing

While secondary use of real-world data (RWD) in education offers substantial research opportunities, data sharing is often limited by privacy constraints. Differentially private synthetic data generation (DP-SDG) has emerged as a possible solution. However, educational RWD is fragmented across platforms and institutions and stored in different formats, so DP-SDG must be tailored to each dataset, requiring substantial engineering effort. In addition, such data are often small-sample and high-dimensional, making deep learning (DL)-based methods common but difficult to implement without specialist expertise. In this setting, it is also hard to achieve practically useful downstream utility. As a result, despite its theoretical promise, DP-SDG remains far from a practical solution in education. To address this issue, we propose a more practical two-stage method: (1) training-free, LLM-based DP-SDG is performed for sharing synthetic data and (2) on-demand real-data validation, where researchers submit code for remote validation of results. This simple method is designed for individual data custodians without extensive DP-SDG expertise. It can also be adapted to multi-shot synthesis, where data from different learner cohorts are synthesised regularly. We evaluate this method experimentally in both the one-shot and multi-shot synthesis settings using RWD collected over three years and conduct a case study with real researchers. Results show that LLM-based DP-SDG performs comparably to a DL-based baseline while greatly reducing engineering costs, and that non-DP validation causes measurable but moderate privacy leakage. Nonetheless, in the case study researchers reported that on average only 36% of synthetic findings are validated on real data. Overall, the paper provides a practical method for sharing educational RWD, while highlighting challenges in risk mitigation and epistemic precision.

翻译：虽然教育领域真实世界数据（RWD）的二次利用提供了丰富的研究机会，但数据共享常受限于隐私约束。差分隐私合成数据生成（DP-SDG）已成为一种潜在解决方案。然而，教育领域的RWD分散于不同平台和机构，且存储格式各异，因此DP-SDG需针对每个数据集进行定制，导致大量工程成本。此外，这类数据常具有小样本、高维度的特点，使得基于深度学习（DL）的方法虽然常见，但若无专家经验则难以实现。在此背景下，实现具有实际效用的下游任务也颇具挑战。因此，尽管DP-SDG在理论层面前景广阔，但在教育领域仍远未成为实用方案。为解决此问题，我们提出了一种更实用的两阶段方法：（1）开展基于大语言模型（LLM）的免训练DP-SDG，用于共享合成数据；（2）按需进行真实数据验证，即研究人员提交代码以远程验证结果。该简单方法专为缺乏深入DP-SDG专业知识的数据管理者设计，并可拓展至多轮合成场景（即定期合成不同学习者群体的数据）。我们利用为期三年采集的RWD，在单轮与多轮合成设定下对该方法进行了实验评估，并开展了真实研究人员的案例研究。结果表明：基于LLM的DP-SDG在显著降低工程成本的同时，性能与基于DL的基线持平；非差分隐私验证虽引发可测量的隐私泄露，但程度适中。不过，案例研究中研究人员报告称，平均仅有36%的合成结论在真实数据上得到验证。总体而言，本文为共享教育RWD提供了一种实用方法，同时揭示了风险缓解与认知精度方面的挑战。