一种基于解耦特征对齐的轻量级开放集目标检测框架 (A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space)

Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSOD-S model achieves a Fixed AP of $26.7\%$, compared to $26.2\%$ for YOLO-World-v1-S and $22.7\%$ for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is $57.1\%$ higher than YOLO-World-v1-S and $29.6\%$ higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.

翻译：开放集目标检测对于非结构化环境中的机器人操作具有重要价值。然而，现有方法常因计算负担重、部署复杂而难以满足机器人应用需求。为此，本文提出一种名为解耦开放集目标检测的轻量级框架，为机器人系统提供实用高效的实时开放集检测解决方案。该框架基于YOLO-World架构，将视觉语言模型与检测器相结合。通过设计多层感知机适配器，将视觉语言模型提取的文本嵌入转换至联合特征空间，使检测器能够在该空间中学习类别无关候选区域的表征。跨模态特征直接在联合空间中对齐，避免了复杂的特征交互，从而显著提升计算效率。在测试阶段，该框架的运行方式与传统闭集检测器相同，有效弥合了闭集与开放集检测之间的鸿沟。相较于基线YOLO-World模型，所提框架在保持相当精度的同时显著提升了实时性能。在LVIS minival数据集上，采用相似骨干网络时，轻量版模型取得了26.7%的固定平均精度，优于YOLO-World-v1-S的26.2%和YOLO-World-v2-S的22.7%。同时，其帧率分别比YOLO-World-v1-S和YOLO-World-v2-S提升57.1%和29.6%。此外，该模型有效促进了边缘设备部署。相关代码与模型已在https://github.com/D-Robotics-AI-Lab/DOSOD开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日