Unifying Image Processing as Visual Prompting Question Answering

Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. Building upon the success of large language models (LLMs) in natural language processing (NLP), there is a similar trend in computer vision, which focuses on developing large-scale models through pretraining and in-context learning. This paradigm shift reduces the reliance on task-specific models, yielding a powerful unified model to deal with various tasks. However, these advances have predominantly concentrated on high-level vision tasks, with less attention paid to low-level vision tasks. To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, \textit{etc}. Our proposed framework, named PromptGIP, unifies these diverse image processing tasks within a universal framework. Inspired by NLP question answering (QA) techniques, we employ a visual prompting question answering paradigm. Specifically, we treat the input-output image pair as a structured question-answer sentence, thereby reprogramming the image processing task as a prompting QA problem. PromptGIP can undertake diverse \textbf{cross-domain} tasks using provided visual prompts, eliminating the need for task-specific finetuning. Our methodology offers a universal and adaptive solution to general image processing. While PromptGIP has demonstrated a certain degree of out-of-domain task generalization capability, further research is expected to fully explore its more powerful emergent generalization.

翻译：图像处理是计算机视觉中的一项基础任务，旨在增强图像质量并提取关键特征以供后续视觉应用使用。传统上，针对单个任务开发专用模型需要不同的专业知识。受大型语言模型（LLMs）在自然语言处理（NLP）中成功的启发，计算机视觉领域也出现了类似趋势，即通过预训练和上下文学习开发大规模模型。这种范式转变减少了对任务专用模型的依赖，从而产生一个强大的统一模型来处理各种任务。然而，这些进展主要集中在高层视觉任务上，而对低层视觉任务的关注较少。为解决这一问题，我们提出了一种适用于通用图像处理的通用模型，涵盖图像恢复、图像增强、图像特征提取等任务。我们提出的框架名为PromptGIP，将这些多样化的图像处理任务统一在一个通用框架内。受NLP中问答（QA）技术的启发，我们采用视觉提示问答范式。具体而言，我们将输入-输出图像对视为结构化的问答句子，从而将图像处理任务重新表述为提示问答问题。PromptGIP能够利用提供的视觉提示执行各种跨领域任务，无需针对特定任务进行微调。我们的方法为通用图像处理提供了一种通用且自适应的解决方案。尽管PromptGIP已展现出一定程度的跨领域任务泛化能力，但未来研究有望进一步探索其更强大的涌现泛化能力。