Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency.However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive fields.To address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process.Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.
翻译:开放词汇检测(OVD)旨在检测超出预定义类别集合的物体。作为将YOLO系列引入OVD的先驱模型,YOLO-World非常适合优先考虑速度和效率的场景。然而,其性能受限于其颈部特征融合机制,该机制导致二次复杂度及受限的引导感受野。为克服这些局限,我们提出了Mamba-YOLO-World,这是一种基于YOLO的新型OVD模型,采用所提出的MambaFusion路径聚合网络作为其颈部架构。具体而言,我们引入了一种创新的基于状态空间模型的特征融合机制,该机制由并行引导选择性扫描算法和串行引导选择性扫描算法组成,具有线性复杂度及全局引导感受野。该机制利用多模态输入序列和Mamba隐藏状态来引导选择性扫描过程。实验表明,在保持参数量和FLOPs相当的情况下,我们的模型在COCO和LVIS基准测试的零样本和微调设置中均优于原始YOLO-World。此外,它以更少的参数量和FLOPs超越了现有最先进的OVD方法。