In Computed Tomography, machine learning is often used for automated data processing. However, increasing model complexity is accompanied by increasingly large volume datasets, which in turn increases the cost of model training. Unlike most work that mitigates this by advancing model architectures and training algorithms, we consider the annotation procedure and its effect on the model performance. We assume three main virtues of a good dataset collected for a model training to be label quality, diversity, and completeness. We compare the effects of those virtues on the model performance using open medical CT datasets and conclude, that quality is more important than diversity early during labeling; the diversity, in turn, is more important than completeness. Based on this conclusion and additional experiments, we propose a labeling procedure for the segmentation of tomographic images to minimize efforts spent on labeling while maximizing the model performance.
翻译:在计算机断层扫描中,机器学习常被用于自动化数据处理。然而,模型复杂度的提升伴随着数据集的日益庞大,进而增加了模型训练的成本。与大多数通过改进模型架构和训练算法来缓解这一问题的研究不同,我们关注的是标注流程及其对模型性能的影响。我们假设用于模型训练的优秀数据集应具备三个主要特征:标注质量、多样性和完整性。通过使用开放式医学CT数据集比较这些特征对模型性能的影响,我们得出结论:在标注初期,质量比多样性更为重要;而多样性又比完整性更为重要。基于这一结论及额外实验,我们针对断层图像分割提出了一种标注流程,旨在最小化标注工作量的同时最大化模型性能。