While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.
翻译:尽管多模态大语言模型(MLLMs)已展现出充分的图像理解能力,但在像素级理解方面仍存在局限,制约了其实际应用。当前的评估任务(如视觉问答和视觉定位)粒度仍过于粗糙,无法准确评估细粒度像素理解能力。虽然分割是像素级理解的基础,但现有方法通常要求MLLMs生成通过外部像素解码器解码的隐式令牌。这种方法破坏了MLLM的文本输出空间,可能损害语言能力并降低灵活性与可扩展性,同时未能反映模型内在的像素级理解能力。为此,我们提出了类人掩码标注任务(HLMAT)这一新范式,使MLLMs能够使用交互式分割工具模仿人类标注者。通过将分割建模为多步马尔可夫决策过程,HLMAT使MLLMs能够迭代生成基于文本的点击点,在不改变架构或使用隐式令牌的情况下获得高质量掩码。基于此框架,我们开发了在类人标注轨迹上微调的SegAgent模型,其性能达到与最先进方法相当的水平,并支持掩码优化和标注过滤等附加任务。HLMAT为评估MLLMs的细粒度像素理解提供了标准化协议,同时引入了以视觉为中心的多步决策任务,有助于探索MLLMs的视觉推理能力。我们对策略改进方法StaR和PRM引导树搜索的适配进一步增强了模型在复杂分割任务中的鲁棒性,为未来MLLMs在细粒度视觉感知和多步决策方面的进展奠定了基础。