This paper proposes LLaFS, the first attempt to leverage large language models (LLMs) in few-shot segmentation. In contrast to the conventional few-shot segmentation methods that only rely on the limited and biased information from the annotated support images, LLaFS leverages the vast prior knowledge gained by LLM as an effective supplement and directly uses the LLM to segment images in a few-shot manner. To enable the text-based LLM to handle image-related tasks, we carefully design an input instruction that allows the LLM to produce segmentation results represented as polygons, and propose a region-attribute table to simulate the human visual mechanism and provide multi-modal guidance. We also synthesize pseudo samples and use curriculum learning for pretraining to augment data and achieve better optimization. LLaFS achieves state-of-the-art results on multiple datasets, showing the potential of using LLMs for few-shot computer vision tasks. Code will be available at https://github.com/lanyunzhu99/LLaFS.
翻译:本文提出LLaFS,首次尝试将大语言模型(LLM)应用于少样本分割任务。与仅依赖标注支持图像中有限且存在偏差信息的传统少样本分割方法不同,LLaFS利用LLM获得的海量先验知识作为有效补充,并直接以少样本方式使用LLM进行图像分割。为使基于文本的LLM能够处理图像相关任务,我们精心设计了输入指令,使LLM能以多边形形式输出分割结果,并提出了区域属性表来模拟人类视觉机制,提供多模态引导。此外,我们还合成伪样本并采用课程学习进行预训练,以增强数据并实现更优的优化效果。LLaFS在多个数据集上取得了最先进的结果,展示了将LLM用于少样本计算机视觉任务的潜力。代码将发布于https://github.com/lanyunzhu99/LLaFS。