Image Segmentation in Foundation Model Era: A Survey

Image segmentation is a long-standing challenge in computer vision, studied continuously over several decades, as evidenced by seminal algorithms such as N-Cut, FCN, and MaskFormer. With the advent of foundation models (FMs), contemporary segmentation methodologies have embarked on a new epoch by either adapting FMs (e.g., CLIP, Stable Diffusion, DINO) for image segmentation or developing dedicated segmentation foundation models (e.g., SAM). These approaches not only deliver superior segmentation performance, but also herald newfound segmentation capabilities previously unseen in deep learning context. However, current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements. This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation. We investigate two basic lines of research -- generic image segmentation (i.e., semantic segmentation, instance segmentation, panoptic segmentation), and promptable image segmentation (i.e., interactive segmentation, referring segmentation, few-shot segmentation) -- by delineating their respective task settings, background concepts, and key challenges. Furthermore, we provide insights into the emergence of segmentation knowledge from FMs like CLIP, Stable Diffusion, and DINO. An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts. Subsequently, we engage in a discussion of open issues and potential avenues for future research. We envisage that this fresh, comprehensive, and systematic survey catalyzes the evolution of advanced image segmentation systems. A public website is created to continuously track developments in this fast advancing field: \url{https://github.com/stanley-313/ImageSegFM-Survey}.

翻译：图像分割是计算机视觉领域一个长期存在的挑战，已持续研究数十年，诸如N-Cut、FCN和MaskFormer等开创性算法便是明证。随着基础模型的出现，当代分割方法通过适配现有基础模型（如CLIP、Stable Diffusion、DINO）用于图像分割，或开发专用的分割基础模型（如SAM），已开启新的纪元。这些方法不仅提供了卓越的分割性能，更预示着深度学习背景下前所未有的新型分割能力。然而，当前图像分割研究缺乏对这些进展所关联的独特特征、挑战及解决方案的详细分析。本综述旨在填补这一空白，对围绕基础模型驱动的图像分割的前沿研究进行全面回顾。我们通过厘清其各自的任务设定、背景概念与核心挑战，系统考察两条基础研究路线——通用图像分割（即语义分割、实例分割、全景分割）与可提示图像分割（即交互式分割、指代分割、少样本分割）。此外，我们深入剖析了CLIP、Stable Diffusion、DINO等基础模型中分割知识的涌现机制。本文提供了涵盖300余种分割方法的详尽概览，以展现当前研究工作的广度。随后，我们探讨了开放性问题及未来研究的潜在方向。我们预期这份新颖、全面且系统化的综述将推动先进图像分割系统的发展。已建立公开网站以持续追踪这一快速发展领域的进展：\url{https://github.com/stanley-313/ImageSegFM-Survey}。