Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. Building upon the success of large language models (LLMs) in natural language processing (NLP), there is a similar trend in computer vision, which focuses on developing large-scale models through pretraining and in-context learning. This paradigm shift reduces the reliance on task-specific models, yielding a powerful unified model to deal with various tasks. However, these advances have predominantly concentrated on high-level vision tasks, with less attention paid to low-level vision tasks. To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, \textit{etc}. Our proposed framework, named PromptGIP, unifies these diverse image processing tasks within a universal framework. Inspired by NLP question answering (QA) techniques, we employ a visual prompting question answering paradigm. Specifically, we treat the input-output image pair as a structured question-answer sentence, thereby reprogramming the image processing task as a prompting QA problem. PromptGIP can undertake diverse \textbf{cross-domain} tasks using provided visual prompts, eliminating the need for task-specific finetuning. Our methodology offers a universal and adaptive solution to general image processing. While PromptGIP has demonstrated a certain degree of out-of-domain task generalization capability, further research is expected to fully explore its more powerful emergent generalization.
翻译:图像处理是计算机视觉中的一项基础任务,旨在增强图像质量并提取关键特征以供后续视觉应用使用。传统上,针对单个任务开发专用模型需要不同的专业知识。受大型语言模型(LLMs)在自然语言处理(NLP)中成功的启发,计算机视觉领域也出现了类似趋势,即通过预训练和上下文学习开发大规模模型。这种范式转变减少了对任务专用模型的依赖,从而产生一个强大的统一模型来处理各种任务。然而,这些进展主要集中在高层视觉任务上,而对低层视觉任务的关注较少。为解决这一问题,我们提出了一种适用于通用图像处理的通用模型,涵盖图像恢复、图像增强、图像特征提取等任务。我们提出的框架名为PromptGIP,将这些多样化的图像处理任务统一在一个通用框架内。受NLP中问答(QA)技术的启发,我们采用视觉提示问答范式。具体而言,我们将输入-输出图像对视为结构化的问答句子,从而将图像处理任务重新表述为提示问答问题。PromptGIP能够利用提供的视觉提示执行各种跨领域任务,无需针对特定任务进行微调。我们的方法为通用图像处理提供了一种通用且自适应的解决方案。尽管PromptGIP已展现出一定程度的跨领域任务泛化能力,但未来研究有望进一步探索其更强大的涌现泛化能力。