Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.
翻译:大型多模态模型(LMMs)在通用视觉-语言理解方面取得了显著进展,但在需要精确对象级定位、细粒度空间推理及可控视觉操作的任务中仍存在局限。具体而言,现有系统常难以正确识别目标实例、跨交互保持对象身份一致性,并高精度定位或修改指定区域。对象中心视觉通过促进视觉实体的显式表示与操作,为应对这些挑战提供了原则性框架,从而将多模态系统从全局场景理解扩展至对象级理解、分割、编辑与生成。本文综述了LMMs与对象中心视觉融合领域的最新进展,将相关文献组织为四大主题:对象中心视觉理解、对象中心指代分割、对象中心视觉编辑及对象中心视觉生成。我们进一步总结了支撑这些能力的关键建模范式、学习策略及评估协议。最后,探讨了开放挑战与未来方向,包括鲁棒的实例持久性、细粒度空间控制、一致的多步交互、统一的跨任务建模,以及分布偏移下的可靠基准测试。期望本文能为可扩展、精确且可信赖的对象中心多模态系统的发展提供结构化视角。