Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency. However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive fields. To address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process. Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.
翻译:开放词汇检测(OVD)旨在检测超出预定义类别集合的物体。作为将YOLO系列引入OVD的先驱模型,YOLO-World非常适合注重速度和效率的应用场景。然而,其性能受限于颈部特征融合机制,该机制导致二次复杂度及受限的引导感受野。为克服这些局限,我们提出Mamba-YOLO-World——一种基于YOLO的新型OVD模型,采用所提出的MambaFusion路径聚合网络(MambaFusion-PAN)作为其颈部架构。具体而言,我们引入一种创新的基于状态空间模型的特征融合机制,该机制包含具有线性复杂度和全局引导感受野的并行引导选择性扫描算法与串行引导选择性扫描算法。该机制利用多模态输入序列和Mamba隐藏状态来引导选择性扫描过程。实验表明,在保持参数量和FLOPs相当的前提下,我们的模型在COCO和LVIS基准测试的零样本与微调设置中均优于原始YOLO-World。此外,该模型以更少的参数量和FLOPs超越了现有最先进的OVD方法。