In the big data era, integrating diverse data modalities poses significant challenges, particularly in complex fields like healthcare. This paper introduces a new process model for multimodal Data Fusion for Data Mining, integrating embeddings and the Cross-Industry Standard Process for Data Mining with the existing Data Fusion Information Group model. Our model aims to decrease computational costs, complexity, and bias while improving efficiency and reliability. We also propose "disentangled dense fusion", a novel embedding fusion method designed to optimize mutual information and facilitate dense inter-modality feature interaction, thereby minimizing redundant information. We demonstrate the model's efficacy through three use cases: predicting diabetic retinopathy using retinal images and patient metadata, domestic violence prediction employing satellite imagery, internet, and census data, and identifying clinical and demographic features from radiography images and clinical notes. The model achieved a Macro F1 score of 0.92 in diabetic retinopathy prediction, an R-squared of 0.854 and sMAPE of 24.868 in domestic violence prediction, and a macro AUC of 0.92 and 0.99 for disease prediction and sex classification, respectively, in radiological analysis. These results underscore the Data Fusion for Data Mining model's potential to significantly impact multimodal data processing, promoting its adoption in diverse, resource-constrained settings.
翻译:在大数据时代,整合多样化的数据模态面临重大挑战,尤其在医疗等复杂领域。本文提出了一种面向数据挖掘的多模态数据融合新过程模型,该模型将嵌入技术与跨行业数据挖掘标准流程(CRISP-DM)融合至现有数据融合信息群组(DFIG)模型中。该模型旨在降低计算成本、复杂性和偏差,同时提升效率与可靠性。我们还提出了"解纠缠密集融合"(disentangled dense fusion)方法——一种新型嵌入融合方法,旨在优化互信息并促进密集的跨模态特征交互,从而最大限度减少冗余信息。通过三个应用案例验证了模型有效性:利用视网膜图像与患者元数据预测糖尿病视网膜病变;结合卫星影像、互联网数据与人口普查数据进行家庭暴力预测;以及基于放射影像与临床记录识别临床与人口统计特征。实验结果表明:糖尿病视网膜病变预测的宏F1分数达0.92;家庭暴力预测的决定系数(R²)为0.854,对称平均绝对百分比误差(sMAPE)为24.868;放射学分析中疾病预测与性别分类的宏AUC分别达到0.92与0.99。这些结果充分彰显了该数据挖掘数据融合(DF-DM)模型在多模态数据处理领域的巨大潜力,可推动其在不同资源受限场景中的广泛应用。