Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSOD-S model achieves a Fixed AP of $26.7\%$, compared to $26.2\%$ for YOLO-World-v1-S and $22.7\%$ for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is $57.1\%$ higher than YOLO-World-v1-S and $29.6\%$ higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.
翻译:开放集目标检测对于非结构化环境中的机器人操作具有重要价值。然而,现有方法常因计算负担重、部署复杂而难以满足机器人应用需求。为此,本文提出一种名为解耦开放集目标检测的轻量级框架,为机器人系统提供实用高效的实时开放集检测解决方案。该框架基于YOLO-World架构,将视觉语言模型与检测器相结合。通过设计多层感知机适配器,将视觉语言模型提取的文本嵌入转换至联合特征空间,使检测器能够在该空间中学习类别无关候选区域的表征。跨模态特征直接在联合空间中对齐,避免了复杂的特征交互,从而显著提升计算效率。在测试阶段,该框架的运行方式与传统闭集检测器相同,有效弥合了闭集与开放集检测之间的鸿沟。相较于基线YOLO-World模型,所提框架在保持相当精度的同时显著提升了实时性能。在LVIS minival数据集上,采用相似骨干网络时,轻量版模型取得了26.7%的固定平均精度,优于YOLO-World-v1-S的26.2%和YOLO-World-v2-S的22.7%。同时,其帧率分别比YOLO-World-v1-S和YOLO-World-v2-S提升57.1%和29.6%。此外,该模型有效促进了边缘设备部署。相关代码与模型已在https://github.com/D-Robotics-AI-Lab/DOSOD开源。