The Deep Learning revolution has enabled groundbreaking achievements in recent years. From breast cancer detection to protein folding, deep learning algorithms have been at the core of very important advancements. However, these modern advancements are becoming more and more data-hungry, especially on labeled data whose availability is scarce: this is even more prevalent in the medical context. In this work, we show how active learning could be very effective in data scarcity situations, where obtaining labeled data (or annotation budget is very limited). We compare several selection criteria (BALD, MeanSTD, and MaxEntropy) on the ISIC 2016 dataset. We also explored the effect of acquired pool size on the model's performance. Our results suggest that uncertainty is useful to the Melanoma detection task, and confirms the hypotheses of the author of the paper of interest, that \textit{bald} performs on average better than other acquisition functions. Our extended analyses however revealed that all acquisition functions perform badly on the positive (cancerous) samples, suggesting exploitation of class unbalance, which could be crucial in real-world settings. We finish by suggesting future work directions that would be useful to improve this current work. The code of our implementation is open-sourced at \url{https://github.com/bonaventuredossou/ece526_course_project}
翻译:深度学习革命近年来带来了突破性成就。从乳腺癌检测到蛋白质折叠,深度学习算法始终处于这些重大进展的核心。然而,这些现代进步对数据的需求日益增长,尤其是对稀缺标注数据的需求——这一现象在医学领域更为突出。本研究展示了主动学习如何在数据稀缺情境下发挥显著效用(当获取标注数据或标注预算极为有限时)。我们在ISIC 2016数据集上比较了几种选择标准(BALD、MeanSTD和MaxEntropy),并探究了采集池大小对模型性能的影响。结果表明,不确定性对黑色素瘤检测任务具有实用价值,并验证了目标论文作者的假设:BALD平均表现优于其他采集函数。然而,我们的扩展分析揭示,所有采集函数对阳性(癌变)样本表现不佳,这提示了类别不平衡的利用问题——该因素在实际场景中可能至关重要。最后,我们提出了可改进当前工作的未来研究方向。本研究的代码已开源在\url{https://github.com/bonaventuredossou/ece526_course_project}。