The necessity of large amounts of labeled data to train deep models, especially in medical imaging creates an implementation bottleneck in resource-constrained settings. In Insite (labelINg medical imageS usIng submodular funcTions and sEmi-supervised data programming) we apply informed subset selection to identify a small number of most representative or diverse images from a huge pool of unlabelled data subsequently annotated by a domain expert. The newly annotated images are then used as exemplars to develop several data programming-driven labeling functions. These labelling functions output a predicted-label and a similarity score when given an unlabelled image as an input. A consensus is brought amongst the outputs of these labeling functions by using a label aggregator function to assign the final predicted label to each unlabelled data point. We demonstrate that informed subset selection followed by semi-supervised data programming methods using these images as exemplars perform better than other state-of-the-art semi-supervised methods. Further, for the first time we demonstrate that this can be achieved through a small set of images used as exemplars.
翻译:训练深度模型需要大量标注数据,尤其是在医学影像领域,这一需求在资源受限的环境中造成了实施瓶颈。INSITE(使用子模函数和半监督数据编程的医学图像标注方法)通过应用信息性子集选择,从海量未标注数据中识别出少量最具代表性或多样性的图像,随后由领域专家进行标注。这些新标注的图像被用作范例,开发多个基于数据编程的标注函数。这些标注函数在输入未标注图像时,输出预测标签和相似度分数。通过使用标签聚合函数对这些标注函数的输出达成共识,为每个未标注数据点分配最终预测标签。我们证明,相较于其他最先进的半监督方法,采用信息性子集选择后,利用这些图像作为范例进行半监督数据编程的方法表现更优。此外,我们首次证明,仅需少量图像作为范例即可实现这一效果。