Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.
翻译:指令遵循型视觉-语言(VL)模型提供了一种灵活的交互界面,支持以零样本方式完成广泛的多模态任务。然而,在全图像上操作的界面并未直接赋予用户"指向"并访问图像中特定区域的能力。这一能力不仅对于支持基于参照的VL基准测试至关重要,而且对于需要精确图像内推理的实用应用同样不可或缺。我们构建了局部化视觉常识模型,允许用户指定(多个)区域作为输入。我们通过从大型语言模型(LLM)中采样局部化常识知识来训练模型:具体而言,我们基于一组VL模型自动生成的全局文字图像描述和局部文字区域描述,提示LLM收集常识知识。借助一个独立训练的、用于筛选高质量样本的评判模型,我们发现基于局部化常识语料库的训练能够成功蒸馏现有VL模型,使其支持参考式输入界面。在零样本设置下的实证结果与人工评估表明,与将生成的指代表达式传递给LLM的基线方法相比,我们的蒸馏方法能获得更精确的VL推理模型。