Object detection in computer vision traditionally involves identifying objects in images. By integrating textual descriptions, we enhance this process, providing better context and accuracy. The MDETR model significantly advances this by combining image and text data for more versatile object detection and classification. However, MDETR's complexity and high computational demands hinder its practical use. In this paper, we introduce Lightweight MDETR (LightMDETR), an optimized MDETR variant designed for improved computational efficiency while maintaining robust multimodal capabilities. Our approach involves freezing the MDETR backbone and training a sole component, the Deep Fusion Encoder (DFE), to represent image and text modalities. A learnable context vector enables the DFE to switch between these modalities. Evaluation on datasets like RefCOCO, RefCOCO+, and RefCOCOg demonstrates that LightMDETR achieves superior precision and accuracy.
翻译:计算机视觉中的目标检测传统上涉及识别图像中的物体。通过整合文本描述,我们改进了这一过程,提供了更好的上下文和准确性。MDETR模型通过结合图像和文本数据进行更通用的目标检测与分类,显著推动了该领域的发展。然而,MDETR的复杂性和高计算需求限制了其实际应用。本文提出轻量级MDETR(LightMDETR),这是一种优化的MDETR变体,旨在提升计算效率的同时保持强大的多模态能力。我们的方法包括冻结MDETR主干网络,并仅训练一个组件——深度融合编码器(DFE)——以表征图像和文本模态。一个可学习的上下文向量使DFE能够在这些模态之间切换。在RefCOCO、RefCOCO+和RefCOCOg等数据集上的评估表明,LightMDETR实现了卓越的精度和准确度。