DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

from arxiv, 29 pages, 20 figures. Homepage: https://aim-uofa.github.io/Diception, Huggingface Demo: https://huggingface.co/spaces/Canyu/Diception-Demo

Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models. Homepage: https://aim-uofa.github.io/Diception, Huggingface Demo: https://huggingface.co/spaces/Canyu/Diception-Demo.

翻译：我们的主要目标是创建一个在计算资源和训练数据有限条件下，能够处理多种任务的优质通用感知模型。为此，我们利用在数十亿图像上预训练的文本到图像扩散模型。我们的详尽评估指标表明，DICEPTION 能有效应对多种感知任务，其性能与最先进的模型相当。我们仅使用 SAM-vit-h 模型 0.06% 的数据（例如，60万与10亿像素级标注图像），便取得了与其相当的结果。受 Wang 等人工作的启发，DICEPTION 使用颜色编码来表征各种感知任务的输出；我们证明了为不同实例分配随机颜色的策略在实体分割和语义分割中都非常有效。将各种感知任务统一为条件图像生成，使我们能够充分利用预训练的文本到图像模型。因此，与从头开始训练的传统模型相比，DICEPTION 能够以低数个数量级的成本进行高效训练。当将我们的模型适配到其他任务时，仅需对少至 50 张图像及其 1% 的参数进行微调。DICEPTION 为视觉通用模型提供了有价值的见解和更具前景的解决方案。项目主页：https://aim-uofa.github.io/Diception，Huggingface 演示：https://huggingface.co/spaces/Canyu/Diception-Demo。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日