Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process

The digitization of natural history collections over the past three decades has unlocked a treasure trove of specimen imagery and metadata. There is great interest in making this data more useful by further labeling it with additional trait data, and modern deep learning machine learning techniques utilizing convolutional neural nets (CNNs) and similar networks show particular promise to reduce the amount of required manual labeling by human experts, making the process much faster and less expensive. However, in most cases, the accuracy of these approaches is too low for reliable utilization of the automatic labeling, typically in the range of 80-85% accuracy. In this paper, we present and validate an approach that can greatly improve this accuracy, essentially by examining the confidence that the network has in the generated label as well as utilizing a user-defined threshold to reject labels that fall below a chosen level. We demonstrate that a naive model that produced 86% initial accuracy can achieve improved performance - over 95% accuracy (rejecting about 40% of the labels) or over 99% accuracy (rejecting about 65%) by selecting higher confidence thresholds. This gives flexibility to adapt existing models to the statistical requirements of various types of research and has the potential to move these automatic labeling approaches from being unusably inaccurate to being an invaluable new tool. After validating the approach in a number of ways, we annotate the reproductive state of a large dataset of over 600,000 herbarium specimens. The analysis of the results points at under-investigated correlations as well as general alignment with known trends. By sharing this new dataset alongside this work, we want to allow ecologists to gather insights for their own research questions, at their chosen point of accuracy/coverage trade-off.

翻译：过去三十年间，自然历史馆藏的数字化进程释放了海量的标本图像与元数据宝库。通过进一步标注附加性状数据以增强这些数据的可用性具有重大意义，而基于卷积神经网络（CNN）及类似网络的现代深度学习技术展现出显著潜力，能够大幅减少所需的人工专家标注工作量，从而极大提升处理速度并降低成本。然而，在多数情况下，此类自动标注方法的准确率（通常处于80%-85%区间）仍不足以支撑可靠的实际应用。本文提出并验证了一种能够显著提升标注准确率的方法，其核心在于分析网络对生成标签的置信度，并结合用户定义的阈值来拒绝对应置信度低于设定水平的标签。我们证明，一个初始准确率为86%的朴素模型，通过选择更高的置信度阈值，能够实现性能提升——在拒绝约40%标签的情况下达到超过95%的准确率，或在拒绝约65%标签的情况下达到超过99%的准确率。这为根据不同类型研究的统计需求调整现有模型提供了灵活性，并有望使这些自动标注方法从原本因精度不足而难以应用，转变为极具价值的新工具。通过多种方式验证该方法的有效性后，我们将其应用于一个包含超过60万份植物标本的大型数据集的繁殖状态标注。结果分析不仅揭示了尚未被充分探究的相关性，也显示出与已知趋势的总体一致性。通过将这一新数据集与本研究一并公开，我们希望助力生态学家根据其特定的准确率/覆盖范围权衡点，为各自的研究问题获取有价值的洞见。