Patchwork Learning: A Paradigm Towards Integrative Analysis across Diverse Biomedical Data Sources

Machine learning (ML) in healthcare presents numerous opportunities for enhancing patient care, population health, and healthcare providers' workflows. However, the real-world clinical and cost benefits remain limited due to challenges in data privacy, heterogeneous data sources, and the inability to fully leverage multiple data modalities. In this perspective paper, we introduce "patchwork learning" (PL), a novel paradigm that addresses these limitations by integrating information from disparate datasets composed of different data modalities (e.g., clinical free-text, medical images, omics) and distributed across separate and secure sites. PL allows the simultaneous utilization of complementary data sources while preserving data privacy, enabling the development of more holistic and generalizable ML models. We present the concept of patchwork learning and its current implementations in healthcare, exploring the potential opportunities and applicable data sources for addressing various healthcare challenges. PL leverages bridging modalities or overlapping feature spaces across sites to facilitate information sharing and impute missing data, thereby addressing related prediction tasks. We discuss the challenges associated with PL, many of which are shared by federated and multimodal learning, and provide recommendations for future research in this field. By offering a more comprehensive approach to healthcare data integration, patchwork learning has the potential to revolutionize the clinical applicability of ML models. This paradigm promises to strike a balance between personalization and generalizability, ultimately enhancing patient experiences, improving population health, and optimizing healthcare providers' workflows.

翻译：机器学习在医疗健康领域为改善患者护理、群体健康及医疗工作流程提供了众多机遇。然而，由于数据隐私问题、数据源异构性以及无法充分利用多模态数据的限制，其在真实世界中的临床和成本效益仍十分有限。在本观点性论文中，我们提出了一种名为"补丁学习"（PL）的新范式，通过整合来自不同数据模态（如临床自由文本、医学影像、组学数据）且分布于独立安全站点的异构数据集，来解决上述局限性。PL能够在保护数据隐私的同时，同步利用互补性数据源，从而开发更全面且更具泛化能力的机器学习模型。我们阐释了补丁学习的概念及其在医疗健康领域的当前实现，探讨了应对各类医疗挑战的潜在机遇与适用数据源。PL通过跨站点的桥接模态或重叠特征空间促进信息共享与缺失数据填补，从而解决关联预测任务。我们讨论了PL面临的挑战（其中许多挑战与联邦学习和多模态学习共有），并提出了该领域未来的研究方向。通过提供更全面的医疗数据整合方法，补丁学习有望革新机器学习模型的临床适用性。该范式承诺在个性化与泛化能力之间取得平衡，最终改善患者体验、提升群体健康水平并优化医疗工作流程。