The increasing demand for access to microdata in official statistics and data-intensive applications raises important challenges concerning disclosure risk, inferential validity and preservation of statistical utility. This paper proposes an interpretable energy-driven framework for privacy-aware synthetic data generation in mixed-type data. The proposed methodology combines discriminative modelling, Bayesian-Network proposal mechanisms, Metropolis--Hastings sampling and post-generation optimization within a constrained probabilistic framework. Unlike perturbation-based approaches, privacy-aware behaviour is achieved through constrained stochastic exploration guided by explicit plausibility, privacy, diversity and structural-coherence penalties. The framework is specifically designed for mixed-type tabular data characterized by sparse configurations, heterogeneous variable types and complex multivariate dependency structures. The generation process is formulated as a multi-objective sampling problem balancing statistical fidelity and disclosure-risk while preserving predictive utility. An extensive empirical evaluation is conducted using a mixed-type individual-level dataset containing demographic, behavioural and health-related variables. The validation strategy combines statistical fidelity diagnostics, predictive analyses, diversity measures, nearest-neighbour risk analysis, membership inference attacks and Split Conformal Prediction. The empirical results suggest that the proposed framework is capable of preserving a substantial portion of the predictive and multivariate structure of the original data while limiting exact memorization phenomena and maintaining favourable privacy-aware behaviour. The proposed methodology provides an interpretable framework for synthetic data generation under competing utility and privacy constraints.
翻译:官方统计和数据密集型应用中对微观数据访问的需求日益增长,这带来了关于披露风险、推断有效性及统计效用保留方面的重大挑战。本文提出了一种可解释的能驱动框架,用于混合类型数据中的隐私感知合成数据生成。所提出的方法在受约束的概率框架内,结合了判别建模、贝叶斯网络提议机制、梅特罗波利斯-黑斯廷斯采样以及生成后优化。与基于扰动的不同方法相比,隐私感知行为是通过受约束的随机探索实现的,该探索由明确的可信性、隐私性、多样性和结构一致性惩罚项引导。该框架专为具有稀疏配置、异质变量类型和复杂多元依赖结构的混合类型表格数据设计。生成过程被建模为一个多目标采样问题,在保留预测效用的同时平衡统计保真度与披露风险。使用包含人口统计、行为和健康相关变量的混合类型个体级数据集进行了广泛的实证评估。验证策略结合了统计保真度诊断、预测分析、多样性度量、最近邻风险分析、成员推理攻击和分割共形预测。实证结果表明,所提出的框架能够保留原始数据的大部分预测和多元结构,同时限制精确记忆现象并保持良好的隐私感知行为。该为在竞争性的效用与隐私约束下生成合成数据提供了一个可解释的框架。