Learning to Detect Baked Goods with Limited Supervision

Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.

翻译：监控剩余产品能为优化未来生产提供宝贵洞见。这对德国烘焙业尤为重要，因为新鲜烘焙食品的保质期极短。自动化此流程可降低劳动力成本、提升准确性并简化运营。我们提出使用目标检测模型从图像中识别烘焙食品以实现自动化。然而，德国烘焙食品种类繁多，使得全监督训练成本过高且可扩展性受限。尽管开放词汇检测器（如OWLv2、Grounding DINO）提供了灵活性，但我们证明其不足以胜任本任务。本研究虽以烘焙业为出发点，但旨在解决计算机视觉在专业化任务且标注数据稀缺的工业领域部署时所面临的更广泛挑战。我们构建了具有不同监督级别的数据集划分，涵盖19类烘焙食品。我们提出两种训练流程，以有限监督训练目标检测模型。首先，我们结合OWLv2与Grounding DINO的定位能力及图像级监督，以弱监督方式训练模型。其次，通过使用Segment Anything 2作为伪标签传播模型对视频帧进行标注并微调，以提升视角鲁棒性。基于这些流程，我们选用YOLOv11进行检测任务训练，因其在速度与精度间具有良好平衡。仅依赖图像级监督时，模型平均精度均值（mAP）达0.91。在非理想部署条件下，通过伪标签微调使模型性能提升19.3%。结合两种流程训练的模型，在仅使用图像级监督的情况下，于非理想部署条件下性能超越了全监督基线模型。