Binary Gaussian Copula Synthesis: an LLM-powered data augmentation framework for early dialysis prediction in chronic kidney disease

Only a small fraction of patients with chronic kidney disease (CKD) progress to dialysis, creating severe class imbalance that limits the performance of machine learning models for early dialysis prediction. This challenge is compounded by the binary structure of electronic health record (EHR) data, for which most existing augmentation methods were not designed. We propose Binary Gaussian Copula Synthesis (BGCS), a two-stage data augmentation method tailored to binary clinical data. BGCS first generates synthetic minority-class samples using a Gaussian copula framework that explicitly models pairwise dependencies among binary features, then applies a fine-tuned GPT-2 classifier to filter out clinically implausible samples before training. We evaluated BGCS on a real-world EHR dataset of 15,169 patients with CKD from West Virginia collected between 2008 and 2022, benchmarking it against SMOTE, CTGAN, and standard Gaussian Copula across four machine learning classifiers over 25 independent runs. BGCS consistently outperformed all comparison methods, achieving the highest minority-class recall for 90-day dialysis prediction, with median values ranging from 0.78 to 0.87 across classifiers, and the strongest distributional fidelity to real data, with a mean p-value of 0.68 across features. The best-performing BGCS-augmented model was integrated into an interpretable decision tree-based clinical decision support system for dialysis risk stratification, with electrolyte imbalances, cardiovascular comorbidities, and renal monitoring indicators emerging as the most influential predictive features. These findings suggest that augmentation methods designed for the structural properties of binary EHR data can meaningfully improve early dialysis risk prediction and support the development of interpretable clinical decision-support tools for CKD care.

翻译：慢性肾脏病（CKD）患者中仅有少数进展至透析阶段，由此产生的严重类别不平衡问题限制了机器学习模型在早期透析预测中的性能。这一挑战因电子健康记录（EHR）数据的二元结构而加剧——现有大多数数据增强方法并非为此类数据设计。我们提出二元高斯Copula合成（BGCS），一种针对二元临床数据的两阶段数据增强方法。BGCS首先利用显式建模二元特征间成对依赖关系的高斯Copula框架生成合成少数类样本，随后通过微调的GPT-2分类器过滤临床不可信的样本，最后将合格数据用于模型训练。我们在2008至2022年间收集的美国西弗吉尼亚州15,169例CKD患者真实EHR数据集上评估BGCS，将其与SMOTE、CTGAN及标准高斯Copula方法进行对比，使用四种机器学习分类器执行25次独立实验。BGCS在所有对比方法中持续表现最优，在90天透析预测任务中取得最高的少数类召回率（各分类器中位值介于0.78-0.87），并展现出最强的分布保真度（各特征均值p值为0.68）。最佳性能的BGCS增强模型被集成至可解释的决策树临床决策支持系统中用于透析风险分层，其中电解质紊乱、心血管合并症及肾功能监测指标成为最具影响力的预测特征。这些发现表明，针对二元EHR数据结构性设计的增强方法能有效提升早期透析风险预测效能，并促进面向CKD护理的可解释临床决策支持工具开发。