Object detection is a fundamental challenge in computer vision, centered on recognizing objects within images, with diverse applications in areas like image analysis, robotics, and autonomous vehicles. Although existing methods have achieved great success, they are often constrained by a fixed vocabulary of objects. To overcome this limitation, approaches like MDETR have redefined object detection by incorporating region-level vision-language pre-training, enabling open-vocabulary object detectors. However, these methods are computationally heavy due to the simultaneous training of large models for both vision and language representations. To address this, we introduce a lightweight framework that significantly reduces the number of parameters while preserving, or even improving, performance. Our solution is applied to MDETR, resulting in the development of Lightweight MDETR (LightMDETR), an optimized version of MDETR designed to enhance computational efficiency without sacrificing accuracy. The core of our approach involves freezing the MDETR backbone and training only the Universal Projection module (UP), which bridges vision and language representations. A learnable modality token parameter allows the UP to seamlessly switch between modalities. Evaluations on tasks like phrase grounding, referring expression comprehension, and segmentation show that LightMDETR not only reduces computational costs but also outperforms several state-of-the-art methods in terms of accuracy.
翻译:目标检测是计算机视觉领域的一项基础性挑战,其核心在于识别图像中的物体,在图像分析、机器人和自动驾驶等领域具有广泛应用。尽管现有方法已取得显著成功,但通常受限于固定的物体词汇表。为克服这一限制,MDETR等方法通过引入区域级视觉-语言预训练重新定义了目标检测,实现了开放词汇目标检测器。然而,这些方法由于需要同时训练视觉和语言表示的大型模型,计算负担沉重。为此,我们提出了一种轻量级框架,在保持甚至提升性能的同时显著减少参数量。我们的解决方案应用于MDETR,由此开发出Lightweight MDETR(LightMDETR)——这是MDETR的优化版本,旨在提升计算效率而不牺牲精度。该方法的核心在于冻结MDETR主干网络,仅训练连接视觉与语言表示的通用投影模块(UP)。可学习的模态令牌参数使UP能够在不同模态间无缝切换。在短语定位、指代表达式理解和分割等任务上的评估表明,LightMDETR不仅降低了计算成本,在精度方面也优于多种先进方法。