Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSOD-S model achieves a Fixed AP of $26.7\%$, compared to $26.2\%$ for YOLO-World-v1-S and $22.7\%$ for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is $57.1\%$ higher than YOLO-World-v1-S and $29.6\%$ higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.
翻译:开放集目标检测(OSOD)对于非结构化环境中的机器人操作具有重要价值。然而,现有OSOD方法常因计算负担重、部署复杂而难以满足机器人应用需求。为解决此问题,本文提出一种名为解耦开放集目标检测(DOSOD)的轻量级框架,为机器人系统实时OSOD任务提供实用高效解决方案。具体而言,DOSOD基于YOLO-World流程,将视觉语言模型(VLM)与检测器集成。通过设计多层感知机(MLP)适配器,将VLM提取的文本嵌入转换至联合空间,检测器在该空间内学习类别无关候选区域的表征。跨模态特征直接在联合空间中对齐,避免了复杂的特征交互,从而提升计算效率。DOSOD在测试阶段运行方式与传统闭集检测器相似,有效弥合了闭集与开放集检测的鸿沟。相较于基线YOLO-World,所提DOSOD在保持相当精度的同时显著提升实时性能。在LVIS minival数据集上使用相似骨干网络时,轻量版DOSOD-S获得26.7%的固定AP,优于YOLO-World-v1-S的26.2%和YOLO-World-v2-S的22.7%。同时,DOSOD-S的帧率较YOLO-World-v1-S提升57.1%,较YOLO-World-v2-S提升29.6%。此外,我们验证了DOSOD模型有利于边缘设备部署。代码与模型已开源:https://github.com/D-Robotics-AI-Lab/DOSOD。