To generate synthetic datasets, e.g., in domains such as healthcare, the literature proposes approaches of two main types: Probabilistic Graphical Models (PGMs) and Deep Learning models, such as LLMs. While PGMs produce synthetic data that can be used for advanced analytics, they do not support complex schemas and datasets. LLMs on the other hand, support complex schemas but produce skewed dataset distributions, which are less useful for advanced analytics. In this paper, we therefore present Amalgam, a hybrid LLM-PGM data synthesis algorithm supporting both advanced analytics, realism, and tangible privacy properties. We show that Amalgam synthesizes data with an average 91 % $χ^2 P$ value and scores 3.8/5 for realism using our proposed metric, where state-of-the-art is 3.3 and real data is 4.7.
翻译:为生成合成数据集(例如在医疗领域),现有文献主要提出两类方法:概率图模型(PGM)与深度学习模型(如大型语言模型)。尽管PGM生成的合成数据可用于高级分析,但无法支持复杂模式与数据集。而LLM虽能处理复杂模式,却会生成分布偏斜的数据集,对高级分析的适用性较低。为此,本文提出一种融合LLM与PGM的混合数据合成算法Amalgam,该算法同时支持高级分析、数据真实性及可量化的隐私属性。实验表明,Amalgam合成的数据在χ² P值上平均达91%,在真实性指标(我们提出的评估标准)上得分为3.8/5——当前最优方法得分为3.3,真实数据得分为4.7。