UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

The reference-based object segmentation tasks, namely referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS), aim to segment a specific object by utilizing either language or annotated masks as references. Despite significant progress in each respective field, current methods are task-specifically designed and developed in different directions, which hinders the activation of multi-task capabilities for these tasks. In this work, we end the current fragmented situation and propose UniRef++ to unify the four reference-based object segmentation tasks with a single architecture. At the heart of our approach is the proposed UniFusion module which performs multiway-fusion for handling different tasks with respect to their specified references. And a unified Transformer architecture is then adopted for achieving instance-level segmentation. With the unified designs, UniRef++ can be jointly trained on a broad range of benchmarks and can flexibly complete multiple tasks at run-time by specifying the corresponding references. We evaluate our unified models on various benchmarks. Extensive experimental results indicate that our proposed UniRef++ achieves state-of-the-art performance on RIS and RVOS, and performs competitively on FSS and VOS with a parameter-shared network. Moreover, we showcase that the proposed UniFusion module could be easily incorporated into the current advanced foundation model SAM and obtain satisfactory results with parameter-efficient finetuning. Codes and models are available at \url{https://github.com/FoundationVision/UniRef}.

翻译：基于参考的目标分割任务，即指代图像分割（RIS）、小样本图像分割（FSS）、指代视频目标分割（RVOS）和视频目标分割（VOS），旨在利用语言或标注掩码作为参考来分割特定目标。尽管各领域取得了显著进展，但现有方法针对特定任务设计并沿不同方向开发，阻碍了这些任务的多任务能力激活。在本文中，我们终结当前碎片化的局面，提出UniRef++以统一架构实现四种基于参考的目标分割任务。我们的方法核心是提出的UniFusion模块，该模块执行多路融合以处理不同任务及其指定参考。随后采用统一的Transformer架构实现实例级分割。通过统一设计，UniRef++可在广泛基准上联合训练，并在运行时通过指定相应参考灵活完成多种任务。我们在多个基准上评估统一模型。大量实验结果表明，我们提出的UniRef++在RIS和RVOS上达到最先进性能，并在参数共享网络下于FSS和VOS上表现具有竞争力。此外，我们展示了所提出的UniFusion模块可轻松集成到当前先进基础模型SAM中，并通过参数高效微调获得满意结果。代码和模型可在\url{https://github.com/FoundationVision/UniRef}获取。