Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.

翻译：指令遵循型视觉-语言（VL）模型提供了一种灵活的交互界面，支持以零样本方式完成广泛的多模态任务。然而，在全图像上操作的界面并未直接赋予用户"指向"并访问图像中特定区域的能力。这一能力不仅对于支持基于参照的VL基准测试至关重要，而且对于需要精确图像内推理的实用应用同样不可或缺。我们构建了局部化视觉常识模型，允许用户指定（多个）区域作为输入。我们通过从大型语言模型（LLM）中采样局部化常识知识来训练模型：具体而言，我们基于一组VL模型自动生成的全局文字图像描述和局部文字区域描述，提示LLM收集常识知识。借助一个独立训练的、用于筛选高质量样本的评判模型，我们发现基于局部化常识语料库的训练能够成功蒸馏现有VL模型，使其支持参考式输入界面。在零样本设置下的实证结果与人工评估表明，与将生成的指代表达式传递给LLM的基线方法相比，我们的蒸馏方法能获得更精确的VL推理模型。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日