Creating a dataset for training supervised machine learning algorithms can be a demanding task. This is especially true for medical image segmentation since this task usually requires one or more specialists for image annotation, and creating ground truth labels for just a single image can take up to several hours. In addition, it is paramount that the annotated samples represent well the different conditions that might affect the imaged tissue as well as possible changes in the image acquisition process. This can only be achieved by considering samples that are typical in the dataset as well as atypical, or even outlier, samples. We introduce a new sampling methodology for selecting relevant images from a larger non-annotated dataset in a way that evenly considers both prototypical as well as atypical samples. The methodology involves the generation of a uniform grid from a feature space representing the samples, which is then used for randomly drawing relevant images. The selected images provide a uniform cover of the original dataset, and thus define a heterogeneous set of images that can be annotated and used for training supervised segmentation algorithms. We provide a case example by creating a dataset containing a representative set of blood vessel microscopy images selected from a larger dataset containing thousands of images.
翻译:创建用于训练监督式机器学习算法的数据集可能是一项艰巨的任务,尤其是在医学图像分割领域,因为该任务通常需要一位或多位专家进行图像标注,且仅单幅图像的真值标签即可耗费数小时之久。此外,标注样本必须充分代表可能影响成像组织的不同条件,以及图像采集过程中的潜在变化,这一点至关重要。这只能通过同时考虑数据集中典型样本、非典型样本乃至离群样本来实现。我们提出一种新采样方法,能从更大的非标注数据集中均衡地选取具有代表性(既包含原型样本也包含非典型样本)的相关图像。该方法涉及从表征样本的特征空间中生成均匀网格,并基于该网格随机抽取相关图像。所选图像能均匀覆盖原始数据集,从而形成可标注的异质性图像集合,用于训练监督式分割算法。我们通过案例展示其应用:从包含数千幅血管显微镜图像的大型数据集中,选取具有代表性的图像子集构建数据集。