mmid (Multi-Modal Integration and Downstream analyses for healthcare analytics) is a Python package that offers multi-modal fusion and imputation, classification, time-to-event prediction and clustering functionalities under a single interface, filling the gap of sequential data integration and downstream analyses for healthcare applications in a structured and flexible environment. mmid wraps in a unique package several algorithms for multi-modal decomposition, prediction and clustering, which can be combined smoothly with a single command and proper configuration files, thus facilitating reproducibility and transferability of studies involving heterogeneous health data sources. A showcase on personalised cardiovascular risk prediction is used to highlight the relevance of a composite pipeline enabling proper treatment and analysis of complex multi-modal data. We thus employed mmid in an example real application scenario involving cardiac magnetic resonance imaging, electrocardiogram, and polygenic risk scores data from the UK Biobank. We proved that the three modalities captured joint and individual information that was used to (1) early identify cardiovascular disease before clinical manifestations with cardiological relevance, and (2) do it better than single data sources alone. Moreover, mmid allowed to impute partially observable data modalities without considerable performance losses in downstream disease prediction, thus proving its relevance for real-world health analytics applications (which are often characterised by the presence of missing data).
翻译:mmid(Multi-Modal Integration and Downstream analyses for healthcare analytics)是一个Python软件包,在统一接口下提供多模态融合与插补、分类、事件时间预测及聚类功能,弥补了结构化且灵活环境下医疗应用场景中序列数据集成与下游分析的空白。该软件包将多种多模态分解、预测与聚类算法整合至单一框架中,通过单一指令及配套配置文件即可实现算法间的无缝组合,从而促进涉及异构健康数据源研究的可复现性与可迁移性。以个性化心血管风险预测为展示案例,凸显了组合式流水线在复杂多模态数据合理处理与分析方法中的重要性。我们基于英国生物样本库的心脏磁共振成像、心电图及多基因风险评分数据,在实际应用场景中部署了mmid。实验证明,这三种模态捕获了联合信息与个体信息,可用于:(1)在具有心脏病学意义的临床症状出现前,早期识别心血管疾病;(2)其预测性能优于单一数据源。此外,mmid能够对部分可观测数据模态进行插补,且在下游疾病预测中不会导致显著性能损失,从而印证了其在实际健康分析应用(通常存在数据缺失特征)中的适用性。