A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSOD-S model achieves a Fixed AP of $26.7\%$, compared to $26.2\%$ for YOLO-World-v1-S and $22.7\%$ for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is $57.1\%$ higher than YOLO-World-v1-S and $29.6\%$ higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.

翻译：开放集目标检测（OSOD）对于非结构化环境中的机器人操作具有重要价值。然而，现有OSOD方法常因计算负担重、部署复杂而难以满足机器人应用需求。为解决此问题，本文提出一种名为解耦开放集目标检测（DOSOD）的轻量级框架，为机器人系统实时OSOD任务提供实用高效解决方案。具体而言，DOSOD基于YOLO-World流程，将视觉语言模型（VLM）与检测器集成。通过设计多层感知机（MLP）适配器，将VLM提取的文本嵌入转换至联合空间，检测器在该空间内学习类别无关候选区域的表征。跨模态特征直接在联合空间中对齐，避免了复杂的特征交互，从而提升计算效率。DOSOD在测试阶段运行方式与传统闭集检测器相似，有效弥合了闭集与开放集检测的鸿沟。相较于基线YOLO-World，所提DOSOD在保持相当精度的同时显著提升实时性能。在LVIS minival数据集上使用相似骨干网络时，轻量版DOSOD-S获得26.7%的固定AP，优于YOLO-World-v1-S的26.2%和YOLO-World-v2-S的22.7%。同时，DOSOD-S的帧率较YOLO-World-v1-S提升57.1%，较YOLO-World-v2-S提升29.6%。此外，我们验证了DOSOD模型有利于边缘设备部署。代码与模型已开源：https://github.com/D-Robotics-AI-Lab/DOSOD。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日