Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
翻译:近年来,大规模多模态模型(LMMs)作为通用多模态助手取得了显著成功,其研究重点主要集中在整体层面的图像-语言与视频-语言理解。相比之下,针对细粒度像素级理解能力的扩展研究则关注较少,而该能力要求模型实现视觉信号与语言语义在像素层面的对齐。先前已有研究将LMMs应用于区域描述生成和指代表达式分割等相关任务。然而,这些模型仅能独立执行指代或分割任务,未能将这些细粒度感知能力整合到视觉推理中。为弥补这一不足,我们提出了UniPixel——一个能够灵活理解视觉提示输入并生成掩码锚定响应的大规模多模态模型。本模型的突出优势在于将像素级感知能力与通用视觉理解能力无缝集成。具体而言,UniPixel可处理视觉提示并根据需求生成相应掩码,在推理过程中以这些中间指向符为条件进行后续推理,从而实现细粒度的像素级推理。我们在涵盖图像/视频中像素级指代/分割及以目标为中心的理解等多样化任务的10个基准测试中验证了方法的有效性。此外,我们还设计了一项新颖的PixelQA任务,该任务联合要求指代、分割与问答能力,以验证本方法的灵活性。