We provide new algorithms for two tasks relating to heterogeneous tabular datasets: clustering, and synthetic data generation. Tabular datasets typically consist of heterogeneous data types (numerical, ordinal, categorical) in columns, but may also have hidden cluster structure in their rows: for example, they may be drawn from heterogeneous (geographical, socioeconomic, methodological) sources, such that the outcome variable they describe (such as the presence of a disease) may depend not only on the other variables but on the cluster context. Moreover, sharing of biomedical data is often hindered by patient confidentiality laws, and there is current interest in algorithms to generate synthetic tabular data from real data, for example via deep learning. We demonstrate a novel EM-based clustering algorithm, MMM (``Madras Mixture Model''), that outperforms standard algorithms in determining clusters in synthetic heterogeneous data, and recovers structure in real data. Based on this, we demonstrate a synthetic tabular data generation algorithm, MMMsynth, that pre-clusters the input data, and generates cluster-wise synthetic data assuming cluster-specific data distributions for the input columns. We benchmark this algorithm by testing the performance of standard ML algorithms when they are trained on synthetic data and tested on real published datasets. Our synthetic data generation algorithm outperforms other literature tabular-data generators, and approaches the performance of training purely with real data.
翻译:我们针对异构表格数据集的两个任务提出了新算法:聚类与合成数据生成。表格数据集通常由异构数据类型(数值型、有序型、类别型)的列构成,但其行间可能存在隐藏的聚类结构:例如,数据可能源自异构(地理、社会经济、方法论)来源,使得其所描述的结果变量(如疾病存在性)不仅取决于其他变量,还可能受聚类情境影响。此外,生物医学数据的共享常受患者保密法规阻碍,当前学界对基于真实数据生成合成表格数据的算法(例如通过深度学习)产生浓厚兴趣。我们提出了一种基于EM的新型聚类算法MMM("Madras混合模型"),该算法在确定合成异构数据聚类方面优于标准算法,并能恢复真实数据中的结构。在此基础上,我们提出合成表格数据生成算法MMMsynth,该算法首先对输入数据进行预聚类,然后假设输入列具有聚类特定的数据分布,生成按聚类区分的合成数据。我们通过测试标准机器学习算法在合成数据上训练后在真实公开数据集上的表现来评估该算法。我们的合成数据生成算法优于文献中其他表格数据生成器,其性能已接近完全使用真实数据训练的水平。