Unifying Image Processing as Visual Prompting Question Answering

Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. Building upon the success of large language models (LLMs) in natural language processing (NLP), there is a similar trend in computer vision, which focuses on developing large-scale models through pretraining and in-context learning. This paradigm shift reduces the reliance on task-specific models, yielding a powerful unified model to deal with various tasks. However, these advances have predominantly concentrated on high-level vision tasks, with less attention paid to low-level vision tasks. To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, etc. Our proposed framework, named PromptGIP, unifies these diverse image processing tasks within a universal framework. Inspired by NLP question answering (QA) techniques, we employ a visual prompting question answering paradigm. Specifically, we treat the input-output image pair as a structured question-answer sentence, thereby reprogramming the image processing task as a prompting QA problem. PromptGIP can undertake diverse cross-domain tasks using provided visual prompts, eliminating the need for task-specific finetuning. Our methodology offers a universal and adaptive solution to general image processing. While PromptGIP has demonstrated a certain degree of out-of-domain task generalization capability, further research is expected to fully explore its more powerful emergent generalization.

翻译：图像处理是计算机视觉中的一项基础任务，旨在提升图像质量并提取关键特征以供后续视觉应用。传统上，针对单个任务需要开发特定模型，而设计此类模型需要不同的专业知识。基于大型语言模型在自然语言处理领域的成功，计算机视觉领域也出现了类似趋势，即通过预训练和上下文学习开发大规模模型。这种范式转变减少了对任务特定模型的依赖，产生了能够处理多种任务的强大统一模型。然而，这些进展主要集中在高级视觉任务上，对低级视觉任务的关注较少。为解决这一问题，我们提出了一种适用于通用图像处理的通用模型，涵盖图像恢复、图像增强、图像特征提取等任务。我们提出的框架名为PromptGIP，将这些多样化的图像处理任务统一到一个通用框架中。受自然语言处理问答技术的启发，我们采用了视觉提示问答范式。具体而言，我们将输入-输出图像对视为结构化的问答句子，从而将图像处理任务重新定义为提示问答问题。PromptGIP能够利用提供的视觉提示处理多种跨领域任务，无需针对特定任务进行微调。我们的方法为通用图像处理提供了一种通用且自适应的解决方案。尽管PromptGIP已展现出一定程度的跨领域任务泛化能力，但进一步的研究有望充分挖掘其更强大的涌现泛化能力。