Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.

翻译：指令遵循型视觉语言（VL）模型提供灵活接口，支持零样本方式下多种多模态任务。然而，基于完整图像运行的接口无法让用户直接"指向"并访问图像内特定区域。这一能力不仅对支撑参考锚定的VL基准测试至关重要，对于需要精确图像内推理的实际应用同样不可或缺。我们构建了本地化视觉常识模型，允许用户将（多个）区域作为输入进行指定。通过从大型语言模型（LLM）中采样本地化常识知识来训练模型：具体而言，我们利用一组VL模型自动生成的全局文字描述和局部区域文字描述，引导LLM收集常识知识。通过单独训练的判别器模型筛选高质量样本，我们发现基于本地化常识语料库的训练能成功蒸馏现有VL模型，使其支持以参考区域为输入的接口。零样本场景下的实验结果与人工评估表明，与将生成指代表达式传递给LLM的基线方法相比，我们的蒸馏方法能构建更精确的VL推理模型。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日