Object detection is a fundamental challenge in computer vision, centered on recognizing objects within images, with diverse applications in areas like image analysis, robotics, and autonomous vehicles. Although existing methods have achieved great success, they are often constrained by a fixed vocabulary of objects. To overcome this limitation, approaches like MDETR have redefined object detection by incorporating region-level vision-language pre-training, enabling open-vocabulary object detectors. However, these methods are computationally heavy due to the simultaneous training of large models for both vision and language representations. To address this, we introduce a lightweight framework that significantly reduces the number of parameters while preserving, or even improving, performance. Our solution is applied to MDETR, resulting in the development of Lightweight MDETR (LightMDETR), an optimized version of MDETR designed to enhance computational efficiency without sacrificing accuracy. The core of our approach involves freezing the MDETR backbone and training only the Universal Projection module (UP), which bridges vision and language representations. A learnable modality token parameter allows the UP to seamlessly switch between modalities. Evaluations on tasks like phrase grounding, referring expression comprehension, and segmentation show that LightMDETR not only reduces computational costs but also outperforms several state-of-the-art methods in terms of accuracy.
翻译:目标检测是计算机视觉领域的一项基础性挑战,其核心在于识别图像中的物体,在图像分析、机器人技术和自动驾驶等众多领域具有广泛应用。尽管现有方法已取得显著成功,但通常受限于固定的物体词汇表。为克服这一局限性,MDETR等方法通过引入区域级视觉-语言预训练重新定义了目标检测,实现了开放词汇目标检测器。然而,由于需要同时训练视觉和语言表示的大型模型,这些方法的计算负担较重。为此,我们提出了一种轻量级框架,在保持甚至提升性能的同时显著减少了参数量。我们将该方案应用于MDETR,从而开发出Lightweight MDETR(LightMDETR)——这是MDETR的优化版本,旨在提升计算效率而不牺牲精度。我们方法的核心在于冻结MDETR主干网络,仅训练连接视觉与语言表示的通用投影模块(UP)。该模块通过可学习的模态令牌参数实现模态间的无缝切换。在短语定位、指代表达式理解和分割等任务上的评估表明,LightMDETR不仅降低了计算成本,在精度方面也超越了多种先进方法。