Proper quality control (QC) is time consuming when working with large-scale medical imaging datasets, yet necessary, as poor-quality data can lead to erroneous conclusions or poorly trained machine learning models. Most efforts to reduce data QC time rely on outlier detection, which cannot capture every instance of algorithm failure. Thus, there is a need to visually inspect every output of data processing pipelines in a scalable manner. We design a QC pipeline that allows for low time cost and effort across a team setting for a large database of diffusion weighted and structural magnetic resonance images. Our proposed method satisfies the following design criteria: 1.) a consistent way to perform and manage quality control across a team of researchers, 2.) quick visualization of preprocessed data that minimizes the effort and time spent on the QC process without compromising the condition or caliber of the QC, and 3.) a way to aggregate QC results across pipelines and datasets that can be easily shared. In addition to meeting these design criteria, we also provide information on what a successful output should be and common occurrences of algorithm failures for various processing pipelines. Our method reduces the time spent on QC by a factor of over 20 when compared to naively opening outputs in an image viewer and demonstrate how it can facilitate aggregation and sharing of QC results within a team. While researchers must spend time on robust visual QC of data, there are mechanisms by which the process can be streamlined and efficient.
翻译:在处理大规模医学影像数据集时,恰当的质量控制(QC)虽耗时但不可或缺,因为低质量数据可能导致错误结论或训练效果不佳的机器学习模型。当前大多数降低数据QC时间的尝试依赖于异常值检测,但该方法无法捕获算法失效的所有情况。因此,需要以可扩展的方式对数据处理流程的每个输出进行可视化检查。我们设计了一种QC流程,能够以较低的时间成本和团队协作代价,应对大规模弥散加权与结构磁共振影像数据库。我们提出的方法满足以下设计标准:1)为研究团队提供执行和管理质量控制的一致性方法;2)实现预处理数据的快速可视化,在保证QC条件和标准的前提下最大限度降低QC过程所需的时间和精力;3)建立跨流程和数据集的质量控制结果聚合机制,便于结果共享。除满足上述设计标准外,我们还提供了各类处理流程成功输出的判定标准及常见算法失效案例。相较于在图像查看器中直接打开输出的原始方法,本方法将QC耗时降低至二十分之一以下,并展示了如何促进团队内QC结果的聚合与共享。尽管研究人员仍需对数据进行稳健的可视化QC,但通过系统化设计可显著提升该过程的效率与流畅性。