Aligning and Prompting Everything All at Once for Universal Visual Perception

Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocabularies and region descriptions while maintaining the effectiveness of cross-modality fusion. To bridge the granularity gap of different pixel-level tasks, APE equalizes semantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual instances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive experiments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet universal perception for anything aligning and prompting is indeed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.

翻译：近年来，视觉基础模型被探索用于构建通用视觉系统。然而，主流范式通过将实例级任务转化为对象-词语对齐，引入了沉重的跨模态交互，这在提示目标检测和视觉定位任务中效果不佳。另一类专注于像素级任务的方法常面临物体与材料之间的注释鸿沟，且前景对象与背景类别分割之间存在相互干扰。与现有方法截然不同，我们提出APE——一种通用视觉感知模型，能够一次性对齐并提示图像中的所有元素，以执行检测、分割和定位等多样任务，其核心是一种实例级句子-对象匹配范式。具体而言，APE通过将语言引导的定位重新表述为开放词汇检测，推动了检测与定位的融合，从而高效地将模型提示扩展到数千个类别词汇和区域描述，同时保持跨模态融合的有效性。为弥合不同像素级任务间的粒度差距，APE将语义分割和全景分割等价视为代理实例学习，将任意孤立区域视为独立实例。APE在兼具自然与挑战性的广泛数据上一次性对齐视觉与语言表征，无需任务特定的微调。在超过160个数据集上的大量实验表明，仅凭一套权重，APE即可超越（或持平）现有最先进模型，证明实现一种高效且通用的“对齐与提示一切”感知是切实可行的。代码和预训练模型已发布在https://github.com/shenyunhang/APE。