The biomedical field is among the sectors most impacted by the increasing regulation of Artificial Intelligence (AI) and data protection legislation, given the sensitivity of patient information. However, the rise of synthetic data generation methods offers a promising opportunity for data-driven technologies. In this study, we propose a statistical approach for synthetic data generation applicable in classification problems. We assess the utility and privacy implications of synthetic data generated by Kernel Density Estimator and K-Nearest Neighbors sampling (KDE-KNN) within a real-world context, specifically focusing on its application in sepsis detection. The detection of sepsis is a critical challenge in clinical practice due to its rapid progression and potentially life-threatening consequences. Moreover, we emphasize the benefits of KDE-KNN compared to current synthetic data generation methodologies. Additionally, our study examines the effects of incorporating synthetic data into model training procedures. This investigation provides valuable insights into the effectiveness of synthetic data generation techniques in mitigating regulatory constraints within the biomedical field.
翻译:生物医学领域是受人工智能(AI)法规和数据保护立法日益严格影响最大的领域之一,这源于患者信息的敏感性。然而,合成数据生成方法的兴起为数据驱动技术提供了有前景的机遇。在本研究中,我们提出了一种适用于分类问题的统计合成数据生成方法。我们评估了由核密度估计与K近邻采样(KDE-KNN)生成的合成数据在实际场景(尤其是针对脓毒症检测应用)中的效用与隐私影响。脓毒症检测是临床实践中的关键挑战,因其病情进展迅速且可能危及生命。此外,我们强调了KDE-KNN相较于当前合成数据生成方法的优势。同时,本研究探讨了将合成数据纳入模型训练流程的影响。该研究为合成数据生成技术在缓解生物医学领域法规限制方面的有效性提供了重要见解。