Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.
翻译:将物体与相应的文本描述对齐是视觉-语言理解领域的一项基本挑战和现实需求。尽管近期的多模态嵌入模型在全局图像-文本对齐方面表现出色,但它们往往难以实现图像区域与特定短语之间的细粒度对齐。本文提出ObjEmbed,一种新颖的多模态大语言模型嵌入模型,它将输入图像分解为多个区域嵌入(每个对应一个独立物体)以及全局嵌入。该模型支持广泛的视觉理解任务,如视觉定位、局部图像检索和全局图像检索。ObjEmbed具备三个关键特性:(1)面向物体的表征:通过为每个区域生成两个互补的嵌入——用于语义匹配的物体嵌入和预测定位质量的交并比嵌入,模型同时捕捉物体的语义和空间信息。最终的物体匹配分数结合了语义相似度与预测的交并比,从而实现更精确的检索。(2)多功能性:能够无缝处理区域级和图像级任务。(3)高效编码:图像中的所有物体及完整图像均通过单次前向传播完成编码,效率极高。在18个多样化基准测试中的卓越性能证明了其强大的语义判别能力。