Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models. Homepage: https://aim-uofa.github.io/Diception, Huggingface Demo: https://huggingface.co/spaces/Canyu/Diception-Demo.
翻译:我们的主要目标是创建一个在计算资源和训练数据有限条件下,能够处理多种任务的优质通用感知模型。为此,我们利用在数十亿图像上预训练的文本到图像扩散模型。我们的详尽评估指标表明,DICEPTION 能有效应对多种感知任务,其性能与最先进的模型相当。我们仅使用 SAM-vit-h 模型 0.06% 的数据(例如,60万与10亿像素级标注图像),便取得了与其相当的结果。受 Wang 等人工作的启发,DICEPTION 使用颜色编码来表征各种感知任务的输出;我们证明了为不同实例分配随机颜色的策略在实体分割和语义分割中都非常有效。将各种感知任务统一为条件图像生成,使我们能够充分利用预训练的文本到图像模型。因此,与从头开始训练的传统模型相比,DICEPTION 能够以低数个数量级的成本进行高效训练。当将我们的模型适配到其他任务时,仅需对少至 50 张图像及其 1% 的参数进行微调。DICEPTION 为视觉通用模型提供了有价值的见解和更具前景的解决方案。项目主页:https://aim-uofa.github.io/Diception,Huggingface 演示:https://huggingface.co/spaces/Canyu/Diception-Demo。