Creating a dataset for training supervised machine learning algorithms can be a demanding task. This is especially true for medical image segmentation since one or more specialists are usually required for image annotation, and creating ground truth labels for just a single image can take up to several hours. In addition, it is paramount that the annotated samples represent well the different conditions that might affect the imaged tissues as well as possible changes in the image acquisition process. This can only be achieved by considering samples that are typical in the dataset as well as atypical, or even outlier, samples. We introduce a new sampling methodology for selecting relevant images from a large dataset in a way that evenly considers both prototypical as well as atypical samples. The methodology involves the generation of a uniform grid from a feature space representing the samples, which is then used for randomly drawing relevant images. The selected images provide a uniform covering of the original dataset, and thus define a heterogeneous set of images that can be annotated and used for training supervised segmentation algorithms. We provide a case example by creating a dataset containing a representative set of blood vessel microscopy images selected from a larger dataset containing thousands of images. The dataset, which we call VessMAP, is being made available online to aid the development of new blood vessel segmentation algorithms.
翻译:创建用于训练监督式机器学习算法的数据集是一项具有挑战性的任务,尤其在医学图像分割领域,因为通常需要一名或多名专家进行图像标注,而仅仅标注单张图像的真值标签就可能耗费数小时。此外,标注样本必须充分代表可能影响成像组织的不同条件以及图像采集过程中的潜在变化。这只能通过考虑数据集中典型样本以及非典型甚至异常样本来实现。我们提出了一种新的采样方法,能够从大型数据集中均匀地选取既包含典型样本也包含非典型样本的相关图像。该方法通过从表示样本的特征空间中生成均匀网格,并利用该网格随机抽取相关图像。所选图像能够均匀覆盖原始数据集,从而定义一组异质图像,可用于标注并训练监督式分割算法。我们以实例展示该方法的应用:从包含数千张图像的大型数据集中筛选出具有代表性的血管显微镜图像子集,创建了名为VessMAP的数据集,该数据集现已公开,旨在促进新型血管分割算法的开发。