The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.
翻译:You Only Look Once(YOLO)系列检测器已成为高效且实用的工具。然而,其依赖于预定义和训练过的目标类别,限制了在开放场景中的适用性。为解决这一局限性,我们提出YOLO-World,一种通过视觉-语言建模和在大规模数据集上的预训练来增强YOLO开放词汇检测能力的创新方法。具体而言,我们提出一种新的可参数化视觉-语言路径聚合网络(RepVL-PAN)和区域-文本对比损失,以促进视觉与语言信息之间的交互。我们的方法能够以高效零样本方式检测广泛的目标。在具有挑战性的LVIS数据集上,YOLO-World在V100上以52.0 FPS达到35.4 AP,在精度和速度上均优于许多当前最先进的方法。此外,微调后的YOLO-World在多个下游任务上表现卓越,包括目标检测和开放词汇实例分割。