Comparing YOLOv8 and Mask RCNN for object segmentation in complex orchard environments

Instance segmentation, an important image processing operation for automation in agriculture, is used to precisely delineate individual objects of interest within images, which provides foundational information for various automated or robotic tasks such as selective harvesting and precision pruning. This study compares the one-stage YOLOv8 and the two-stage Mask R-CNN machine learning models for instance segmentation under varying orchard conditions across two datasets. Dataset 1, collected in dormant season, includes images of dormant apple trees, which were used to train multi-object segmentation models delineating tree branches and trunks. Dataset 2, collected in the early growing season, includes images of apple tree canopies with green foliage and immature (green) apples (also called fruitlet), which were used to train single-object segmentation models delineating only immature green apples. The results showed that YOLOv8 performed better than Mask R-CNN, achieving good precision and near-perfect recall across both datasets at a confidence threshold of 0.5. Specifically, for Dataset 1, YOLOv8 achieved a precision of 0.90 and a recall of 0.95 for all classes. In comparison, Mask R-CNN demonstrated a precision of 0.81 and a recall of 0.81 for the same dataset. With Dataset 2, YOLOv8 achieved a precision of 0.93 and a recall of 0.97. Mask R-CNN, in this single-class scenario, achieved a precision of 0.85 and a recall of 0.88. Additionally, the inference times for YOLOv8 were 10.9 ms for multi-class segmentation (Dataset 1) and 7.8 ms for single-class segmentation (Dataset 2), compared to 15.6 ms and 12.8 ms achieved by Mask R-CNN's, respectively.

翻译：实例分割作为农业自动化中重要的图像处理操作，可精确描绘图像中感兴趣的目标个体，为选择性采收、精准修剪等自动化或机器人任务提供基础信息。本研究比较了单阶段YOLOv8与双阶段Mask R-CNN机器学习模型在两个数据集的不同果园条件下进行实例分割的性能。数据集1采集于休眠期，包含休眠苹果树图像，用于训练多目标分割模型以描绘树枝和树干。数据集2采集于生长早期，包含绿叶覆盖的苹果树冠及未成熟青苹果（亦称幼果）的图像，用于训练仅分割未成熟青苹果的单目标分割模型。结果表明，YOLOv8在置信度阈值为0.5时表现优于Mask R-CNN，在两个数据集中均实现了良好的精确率与近乎完美的召回率。具体而言，对于数据集1，YOLOv8在所有类别上的精确率为0.90，召回率为0.95；而Mask R-CNN在同一数据集上的精确率为0.81，召回率为0.81。对于数据集2，YOLOv8的精确率达0.93，召回率为0.97；Mask R-CNN在此单类别场景下的精确率为0.85，召回率为0.88。此外，YOLOv8的多类别分割（数据集1）推理时间为10.9毫秒，单类别分割（数据集2）为7.8毫秒，而Mask R-CNN的相应推理时间分别为15.6毫秒和12.8毫秒。