Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.
翻译:文本引导的开放词汇目标计数(TOOC)旨在估计由文本提示描述的目标数量,在具有大尺度变化的密集场景中尤其具有挑战性。现有TOOC方法主要依赖Transformer,其计算复杂度随图像分辨率呈二次增长,限制了可扩展性。Mamba因其线性复杂度提供了一种有前景的替代方案。然而,现有基于Mamba的方法存在两个主要局限性:一方面,Mamba固有的因果公式限制了非因果视觉任务所需的双向空间依赖建模;另一方面,现有基于Mamba的视觉模型常忽略空间标记响应中不受约束的高熵,这可能削弱局部细节和高频线索。为解决这些问题,我们提出MambaCount——一种基于空间稀疏状态空间对偶(S⁴D)模块的高效框架。具体而言,我们分析并重构了Mamba中隐藏状态的衰减动态,以缓解因果建模引入的依赖约束。此外,我们引入空间标记选择(STS)子模块,以降低Mamba中空间标记响应中的不受约束高熵。我们还设计了多粒度原型(MGP),以在不同语义层级识别类目标区域,从而改进跨模态对齐与可解释性。在FSC-147上的广泛实验表明,MambaCount在无需二次查询的方法中达到了最先进性能,测试MAE为12.23,同时保持线性复杂度。