Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.
翻译:指令遵循型视觉语言(VL)模型提供灵活接口,支持零样本方式下多种多模态任务。然而,基于完整图像运行的接口无法让用户直接"指向"并访问图像内特定区域。这一能力不仅对支撑参考锚定的VL基准测试至关重要,对于需要精确图像内推理的实际应用同样不可或缺。我们构建了本地化视觉常识模型,允许用户将(多个)区域作为输入进行指定。通过从大型语言模型(LLM)中采样本地化常识知识来训练模型:具体而言,我们利用一组VL模型自动生成的全局文字描述和局部区域文字描述,引导LLM收集常识知识。通过单独训练的判别器模型筛选高质量样本,我们发现基于本地化常识语料库的训练能成功蒸馏现有VL模型,使其支持以参考区域为输入的接口。零样本场景下的实验结果与人工评估表明,与将生成指代表达式传递给LLM的基线方法相比,我们的蒸馏方法能构建更精确的VL推理模型。