Patchwork Learning: A Paradigm Towards Integrative Analysis across Diverse Biomedical Data Sources

Machine learning (ML) in healthcare presents numerous opportunities for enhancing patient care, population health, and healthcare providers' workflows. However, the real-world clinical and cost benefits remain limited due to challenges in data privacy, heterogeneous data sources, and the inability to fully leverage multiple data modalities. In this perspective paper, we introduce "patchwork learning" (PL), a novel paradigm that addresses these limitations by integrating information from disparate datasets composed of different data modalities (e.g., clinical free-text, medical images, omics) and distributed across separate and secure sites. PL allows the simultaneous utilization of complementary data sources while preserving data privacy, enabling the development of more holistic and generalizable ML models. We present the concept of patchwork learning and its current implementations in healthcare, exploring the potential opportunities and applicable data sources for addressing various healthcare challenges. PL leverages bridging modalities or overlapping feature spaces across sites to facilitate information sharing and impute missing data, thereby addressing related prediction tasks. We discuss the challenges associated with PL, many of which are shared by federated and multimodal learning, and provide recommendations for future research in this field. By offering a more comprehensive approach to healthcare data integration, patchwork learning has the potential to revolutionize the clinical applicability of ML models. This paradigm promises to strike a balance between personalization and generalizability, ultimately enhancing patient experiences, improving population health, and optimizing healthcare providers' workflows.

翻译：机器学习在医疗健康领域为提升患者护理、群体健康及医护人员工作流程带来了诸多机遇。然而，由于数据隐私、异质性数据源以及无法充分利用多种数据模态等挑战，其在真实世界中的临床效果与成本效益仍然有限。在本视角研究论文中，我们提出"拼图学习"这一新范式，通过整合来自不同数据模态（如临床自由文本、医学影像、组学数据）且分布于独立安全站点的异质数据集来解决上述限制。该范式允许在保护数据隐私的同时同步利用互补性数据源，从而开发更全面且更具泛化能力的机器学习模型。我们阐述了拼图学习的概念及其在医疗健康领域的现有实现方式，探讨了应对各类医疗挑战的潜在机遇与适用数据源。拼图学习通过利用站点间的桥接模态或重叠特征空间促进信息共享与缺失值插补，从而解决相关预测任务。我们讨论了该范式面临的挑战（其中多数与联邦学习和多模态学习共有），并为该领域未来研究提出建议。通过提供更全面的医疗数据整合方案，拼图学习有望彻底改变机器学习模型的临床适用性。该范式承诺在个性化与泛化能力之间取得平衡，最终提升患者体验、改善群体健康并优化医护人员工作流程。