The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses it in a training-free manner. By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the VLM to answer multiple-choice questions like a human; vision models such as YOLO and Segment Anything Model (SAM) assist the VLM in completing the task. The modular framework of the proposed method makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.
翻译:少样本图像分类与分割(FS-CS)任务要求在仅有少量目标类别示例的情况下,对查询图像中的目标物体进行分类与分割。我们提出了视觉指令引导分割与评估(VISE)方法,该方法将FS-CS问题转化为视觉问答(VQA)问题,利用视觉语言模型(VLM)以无需训练的方式解决该问题。通过使VLM能够与现成的视觉模型作为工具进行交互,所提方法仅使用图像级标签即可对目标物体进行分类与分割。具体而言,链式思维提示和上下文学习引导VLM像人类一样回答多项选择题;YOLO和Segment Anything Model(SAM)等视觉模型辅助VLM完成任务。所提方法的模块化框架使其易于扩展。我们的方法在Pascal-5i和COCO-20i数据集上达到了最先进的性能。