In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practice$\unicode{x2013}$maximizing dataset size and class balance$\unicode{x2013}$does not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but $A$$\unicode{x2013}$"big alpha"$\unicode{x2013}$a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for image similarities. One of these, $A_0$, explained 67% of the variance in balanced accuracy, vs. 54% for class balance and just 39% for size. The best pair of measures was size-plus-$A_1$ (79%), which outperformed size-plus-class-balance (74%). Subsets with the largest $A_0$ performed up to 16% better than those with the largest size (median improvement, 8%). We propose maximizing $A$ as a way to improve deep learning performance in medical imaging.
翻译:在深度学习中,要在图像分类任务上实现高性能,需要多样化的训练集。然而,当前的最佳实践——最大化数据集规模和类别平衡——并不能保证数据集的多样性。我们假设,对于给定的模型架构,通过更直接地最大化多样性可以提高模型性能。为验证这一假设,我们引入了一套来自生态学的综合性多样性度量框架,该框架通过考虑图像间的相似性,推广了香农熵等熟悉的概念(规模和类别平衡作为特例出现)。对七个医学数据集中数千个子集的分析表明,与性能相关性最强的并非规模或类别平衡,而是A(“大alpha”)——这是一组广义熵度量,解释为在考虑图像相似性后数据集中有效的图像-类别对数量。其中A_0解释了平衡准确率67%的方差,而类别平衡仅解释54%,规模仅解释39%。最佳度量组合为规模加A_1(79%),优于规模加类别平衡的组合(74%)。具有最大A_0值的子集比具有最大规模的子集性能提升最高达16%(中位数提升为8%)。我们建议将最大化A作为提升医学影像深度学习性能的有效途径。