The success of large language models (LLMs) has fostered a new research trend of multi-modality large language models (MLLMs), which changes the paradigm of various fields in computer vision. Though MLLMs have shown promising results in numerous high-level vision and vision-language tasks such as VQA and text-to-image, no works have demonstrated how low-level vision tasks can benefit from MLLMs. We find that most current MLLMs are blind to low-level features due to their design of vision modules, thus are inherently incapable for solving low-level vision tasks. In this work, we purpose $\textbf{LM4LV}$, a framework that enables a FROZEN LLM to solve a range of low-level vision tasks without any multi-modal data or prior. This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks. We hope this work can inspire new perspectives on LLMs and deeper understanding of their mechanisms.
翻译:大语言模型(LLMs)的成功催生了多模态大语言模型(MLLMs)这一新的研究趋势,改变了计算机视觉多个领域的范式。尽管MLLMs在视觉问答(VQA)、文本生成图像等诸多高级视觉与视觉-语言任务中展现出良好性能,目前尚无工作证明低级视觉任务如何能从MLLMs中受益。我们发现,由于现有MLLMs视觉模块的设计缺陷,它们大多对低级视觉特征“视而不见”,因而本质上无法解决低级视觉任务。本工作中,我们提出$\textbf{LM4LV}$框架,该框架使一个完全冻结的LLM能够在无需任何多模态数据或先验知识的情况下,解决一系列低级视觉任务。这展示了LLM在低级视觉领域的强大潜力,并弥合了MLLMs与低级视觉任务之间的鸿沟。我们希望这项工作能为LLMs的研究提供新视角,并促进对其机制更深入的理解。