Text-guided Open-vocabulary Object Counting (TOOC) enables counting arbitrary object categories specified by text prompts, offering substantially greater flexibility than conventional closed-set counting. However, existing TOOC methods are developed and evaluated primarily on ideal images, while real-world scenes often suffer from adverse conditions such as rain, fog, darkness, and sensor noise, which severely degrade visual quality and impair vision-language alignment. To bridge this gap, we introduce Robust-TOOC, the first benchmark for evaluating TOOC under diverse corruption conditions, which covers six representative degradation types: rain, fog, darkness, Gaussian noise, salt-and-pepper noise, and mixed corruption. To improve robustness while preserving the original counting architecture, we propose Dual-TTT, a dual-architecture test-time training framework for TOOC. Specifically, during test-time training, Dual-TTT updates only the Text-guided Lightweight Denoising module (TL-Denoiser), while keeping the original counting network frozen. Inspired by diffusion models, the TL-Denoiser is optimized to remove corruption-aware noise from image representations under degraded conditions. Since only the TL-Denoiser is trained at test time, Dual-TTT is annotation-free and can be seamlessly integrated into existing TOOC models without modifying their original architecture. Extensive experiments on multiple recent TOOC baselines demonstrate the effectiveness of our method.
翻译:文本引导的开放词汇目标计数(TOOC)能够根据文本提示对任意指定类别的物体进行计数,相比传统的闭集计数具有更高的灵活性。然而,现有TOOC方法主要在理想图像上开发和评估,而实际场景常受到雨、雾、黑暗和传感器噪声等不利条件的影响,这些因素严重降低视觉质量并削弱视觉-语言对齐。为弥补这一差距,我们提出Robust-TOOC,这是首个在多种退化条件下评估TOOC的基准,涵盖六种代表性退化类型:雨、雾、黑暗、高斯噪声、椒盐噪声和混合退化。为在保持原始计数架构的同时提升鲁棒性,我们提出Dual-TTT,一种用于TOOC的双架构测试时训练框架。具体而言,在测试时训练阶段,Dual-TTT仅更新文本引导轻量级去噪模块(TL-Denoiser),同时冻结原始计数网络。受扩散模型启发,TL-Denoiser经过优化,可从退化条件下的图像表示中去除与退化相关的噪声。由于仅在测试时训练TL-Denoiser,Dual-TTT无需标注数据,且可无缝集成到现有TOOC模型中,无需修改其原始架构。在多个最新TOOC基线上的大量实验证明了我们方法的有效性。