Human-in-the-Loop Segmentation of Multi-species Coral Imagery

from arxiv, Journal article preprint of extended paper, 30 pages, 11 figures. Original conference paper (v2) accepted at the CVPR2024 3rd Workshop on Learning with Limited Labelled Data for Image and Video Understanding (L3D-IVU)

Marine surveys by robotic underwater and surface vehicles result in substantial quantities of coral reef imagery, however labeling these images is expensive and time-consuming for domain experts. Point label propagation is a technique that uses existing images labeled with sparse points to create augmented ground truth data, which can be used to train a semantic segmentation model. In this work, we show that recent advances in large foundation models facilitate the creation of augmented ground truth masks using only features extracted by the denoised version of the DINOv2 foundation model and K-Nearest Neighbors (KNN), without any pre-training. For images with extremely sparse labels, we present a labeling method based on human-in-the-loop principles, which greatly enhances annotation efficiency: in the case that there are 5 point labels per image, our human-in-the-loop method outperforms the prior state-of-the-art by 14.2% for pixel accuracy and 19.7% for mIoU; and by 8.9% and 18.3% if there are 10 point labels. When human-in-the-loop labeling is not available, using the denoised DINOv2 features with a KNN still improves on the prior state-of-the-art by 2.7% for pixel accuracy and 5.8% for mIoU (5 grid points). On the semantic segmentation task, we outperform the prior state-of-the-art by 8.8% for pixel accuracy and by 13.5% for mIoU when only 5 point labels are used for point label propagation. Additionally, we perform a comprehensive study into the impacts of the point label placement style and the number of points on the point label propagation quality, and make several recommendations for improving the efficiency of labeling images with points.

翻译：水下机器人与水面航行器进行的海洋勘测产生了大量珊瑚礁图像，然而领域专家对这些图像进行标注既昂贵又耗时。点标签传播是一种利用现有带有稀疏点标注的图像来创建增强真实数据的技术，该数据可用于训练语义分割模型。在本研究中，我们表明，大型基础模型的最新进展促进了增强真实掩码的创建，该过程仅使用去噪版DINOv2基础模型提取的特征和K-最近邻算法，无需任何预训练。对于标签极度稀疏的图像，我们提出了一种基于人机协同原则的标注方法，该方法显著提升了标注效率：在每张图像仅有5个点标签的情况下，我们的人机协同方法在像素准确率上超越先前最佳方法14.2%，在mIoU上超越19.7%；当有10个点标签时，分别超越8.9%和18.3%。当无法使用人机协同标注时，采用去噪DINOv2特征与KNN组合仍能在像素准确率上超越先前最佳方法2.7%，在mIoU上超越5.8%（5个网格点）。在语义分割任务中，当仅使用5个点标签进行点标签传播时，我们在像素准确率上超越先前最佳方法8.8%，在mIoU上超越13.5%。此外，我们对点标签放置方式和点数对点标签传播质量的影响进行了全面研究，并就如何提高点标注图像效率提出了若干建议。