In the Detection and Multi-Object Tracking of Sweet Peppers Challenge, we present Track Any Peppers (TAP) - a weakly supervised ensemble technique for sweet peppers tracking. TAP leverages the zero-shot detection capabilities of vision-language foundation models like Grounding DINO to automatically generate pseudo-labels for sweet peppers in video sequences with minimal human intervention. These pseudo-labels, refined when necessary, are used to train a YOLOv8 segmentation network. To enhance detection accuracy under challenging conditions, we incorporate pre-processing techniques such as relighting adjustments and apply depth-based filtering during post-inference. For object tracking, we integrate the Matching by Segment Anything (MASA) adapter with the BoT-SORT algorithm. Our approach achieves a HOTA score of 80.4%, MOTA of 66.1%, Recall of 74.0%, and Precision of 90.7%, demonstrating effective tracking of sweet peppers without extensive manual effort. This work highlights the potential of foundation models for efficient and accurate object detection and tracking in agricultural settings.
翻译:在甜椒检测与多目标追踪挑战赛中,我们提出了Track Any Peppers(TAP)——一种用于甜椒追踪的弱监督集成技术。TAP利用如Grounding DINO等视觉语言基础模型的零样本检测能力,以最少的人工干预自动为视频序列中的甜椒生成伪标签。这些伪标签在必要时经过精炼,用于训练YOLOv8分割网络。为提升在挑战性条件下的检测精度,我们引入了预处理技术(如重光照调整)并在推理后应用基于深度的过滤。对于目标追踪,我们将Matching by Segment Anything(MASA)适配器与BoT-SORT算法相结合。我们的方法取得了80.4%的HOTA分数、66.1%的MOTA分数、74.0%的召回率与90.7%的精确率,证明了无需大量人工即可有效追踪甜椒。本工作凸显了基础模型在农业场景中实现高效精准目标检测与追踪的潜力。