The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs). In order to enhance the level of visual comprehension, recent studies have equipped LMMs with region-level understanding capabilities by representing object bounding box coordinates as a series of text sequences (pixel2seq). In this paper, we introduce a novel paradigm for object location modeling called pixel2emb method, where we ask the LMM to output the location embeddings and then decoded by different decoders. This paradigm allows for different location formats (such as bounding boxes and masks) to be used in multimodal conversations Furthermore, this kind of embedding based location modeling enables the utilization of existing practices in localization tasks, such as detection and segmentation. In scenarios with limited resources, our pixel2emb demonstrates superior performance compared to existing state-of-the-art (SOTA) approaches in both the location input and output tasks under fair comparison. Leveraging the proposed pixel2emb method, we train an LMM named NExT-Chat and demonstrate its capability of handling multiple tasks like visual grounding, region caption, and grounded reasoning.
翻译:大型语言模型(LLM)的发展极大地推动了多模态理解领域的进步,催生了大语言多模态模型(LMM)。为提升视觉理解水平,近期研究通过将目标边界框坐标表示为文本序列序列(pixel2seq)的方法,赋予LMM区域级理解能力。本文提出一种新颖的目标位置建模范式——pixel2emb方法,该方法要求LMM输出位置嵌入,再由不同解码器进行解码。该范式支持在多模态对话中使用不同位置格式(如边界框和掩码)。此外,这种基于嵌入的位置建模能够利用检测与分割等定位任务的现有实践。在资源有限场景下,我们的pixel2emb方法在位置输入和输出任务的公平对比中均展现出优于现有最先进(SOTA)方法的性能。基于所提出的pixel2emb方法,我们训练了名为NExT-Chat的LMM模型,并验证其具备视觉定位、区域描述及基于区域的推理等多任务处理能力。