Purpose: Federated training is often hindered by heterogeneous datasets due to divergent data storage options, inconsistent naming schemes, varied annotation procedures, and disparities in label quality. This is particularly evident in the emerging multi-modal learning paradigms, where dataset harmonization including a uniform data representation and filtering options are of paramount importance. Methods: DICOM structured reports enable the standardized linkage of arbitrary information beyond the imaging domain and can be used within Python deep learning pipelines with highdicom. Building on this, we developed an open platform for data integration and interactive filtering capabilities that simplifies the process of assembling multi-modal datasets. Results: In this study, we extend our prior work by showing its applicability to more and divergent data types, as well as streamlining datasets for federated training within an established consortium of eight university hospitals in Germany. We prove its concurrent filtering ability by creating harmonized multi-modal datasets across all locations for predicting the outcome after minimally invasive heart valve replacement. The data includes DICOM data (i.e. computed tomography images, electrocardiography scans) as well as annotations (i.e. calcification segmentations, pointsets and pacemaker dependency), and metadata (i.e. prosthesis and diagnoses). Conclusion: Structured reports bridge the traditional gap between imaging systems and information systems. Utilizing the inherent DICOM reference system arbitrary data types can be queried concurrently to create meaningful cohorts for clinical studies. The graphical interface as well as example structured report templates will be made publicly available.
翻译:目的:联邦训练常因数据集异构性而受阻,这源于数据存储方案差异、命名规范不一致、标注流程多样化以及标签质量参差不齐。这一问题在新兴的多模态学习范式中尤为突出,此时包含统一数据表示与筛选功能的数据集协调至关重要。方法:DICOM结构化报告支持对影像域外任意信息进行标准化关联,并可通过highdicom在Python深度学习流程中使用。基于此,我们开发了一个具备数据集成与交互式筛选功能的开放平台,以简化多模态数据集的构建流程。结果:本研究通过将平台应用于更多元的数据类型,并在德国八所大学医院组成的联盟内优化联邦训练数据集,拓展了先前工作。我们通过为所有机构创建协调的多模态数据集(用于预测微创心脏瓣膜置换术后结果),验证了平台的并行筛选能力。数据涵盖DICOM数据(如计算机断层扫描图像、心电图扫描)、标注数据(如钙化分割、点集与起搏器依赖记录)及元数据(如假体信息与诊断记录)。结论:结构化报告弥合了影像系统与信息系统间的传统鸿沟。利用DICOM固有参照系统,可并行查询任意数据类型以构建具有临床研究价值的队列。图形界面及示例结构化报告模板将公开发布。