Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? - 专知论文

会员服务 ·

0

INFORMS · MoDELS · 语言模型化 · 可理解性 · Analysis ·

2023 年 2 月 23 日

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

翻译：预训练视觉与语言模型能否回答视觉信息寻求型问题？

Yang Chen,Hexiang Hu,Yi Luan,Haitian Sun,Soravit Changpinyo,Alan Ritter,Ming-Wei Chang

from arxiv, Our dataset and evaluation is available at https://open-vision-language.github.io/infoseek/

Large language models have demonstrated an emergent capability in answering knowledge intensive questions. With recent progress on web-scale visual and language pre-training, do these models also understand how to answer visual information seeking questions? To answer this question, we present InfoSeek, a Visual Question Answering dataset that focuses on asking information-seeking questions, where the information can not be answered by common sense knowledge. We perform a multi-stage human annotation to collect a natural distribution of high-quality visual information seeking question-answer pairs. We also construct a large-scale, automatically collected dataset by combining existing visual entity recognition datasets and Wikidata, which provides over one million examples for model fine-tuning and validation. Based on InfoSeek, we analyzed various pre-trained Visual QA systems to gain insights into the characteristics of different pre-trained models. Our analysis shows that it is challenging for the state-of-the-art multi-modal pre-trained models to answer visual information seeking questions, but this capability is improved through fine-tuning on the automated InfoSeek dataset. We hope our analysis paves the way to understand and develop the next generation of multi-modal pre-training.

翻译：大型语言模型在回答知识密集型问题方面已展现出涌现能力。随着网络规模视觉与语言预训练的最新进展，这些模型是否也能理解如何回答视觉信息寻求型问题？为探究此问题，我们提出InfoSeek——一个专注于信息寻求型问题的视觉问答数据集，其中所涉信息无法通过常识知识回答。我们采用多阶段人工标注，收集了高质量视觉信息寻求型问答对的自然分布，并构建了一个大规模自动采集数据集（融合现有视觉实体识别数据集与维基百科），提供超百万样本用于模型微调与验证。基于InfoSeek，我们分析了多种预训练视觉问答系统的特性，揭示了不同预训练模型的差异。研究表明，当前最先进的多模态预训练模型在回答视觉信息寻求型问题时仍面临挑战，但通过基于自动生成的InfoSeek数据集进行微调，该能力得以提升。我们期望这一分析能为理解与开发下一代多模态预训练模型奠定基础。

0

相关内容

INFORMS

《计算机信息》杂志发表高质量的论文，扩大了运筹学和计算的范围，寻求有关理论、方法、实验、系统和应用方面的原创研究论文、新颖的调查和教程论文，以及描述新的和有用的软件工具的论文。官网链接：https://pubsonline.informs.org/journal/ijoc

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新六篇知识图谱相关论文—Zero-shot识别、卷积二维知识图谱、变分知识图谱推理、张量分解、推荐

【论文推荐】最新六篇知识图谱相关论文—Zero-shot识别、卷积二维知识图谱、变分知识图谱推理、张量分解、推荐

专知

50+阅读 · 2018年4月25日

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

专知

17+阅读 · 2018年4月19日

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

专知

30+阅读 · 2018年3月22日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

活性氧介导的内质网应激在博莱霉素诱发肺上皮-间质转化和肺纤维化中的作用

国家自然科学基金

0+阅读 · 2016年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

金属磁性材料的拓扑结构与电磁性能

国家自然科学基金

0+阅读 · 2013年12月31日

EAST装置上离子回旋新型长天线的设计与耦合研究

国家自然科学基金

0+阅读 · 2012年12月31日

U2 snRNP在DNA损伤应激反应中的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

microRNA调节肿瘤抑制因子Caliban应答DNA损伤的机制

国家自然科学基金

1+阅读 · 2012年12月31日

共振Schottky探针研制

国家自然科学基金

0+阅读 · 2012年12月31日

癌/睾丸抗原HCA587对转录因子NF-κB的调节作用研究

国家自然科学基金

0+阅读 · 2011年12月31日

联合携PGE2基因骨髓间充质干细胞移植重塑Kuppfer细胞表型诱导大鼠肝移植术后免疫耐受

国家自然科学基金

0+阅读 · 2011年12月31日

惰性气体与O2,N2,CO,NO、CN相互作用势及碰撞激发截面的同位素效应研究

国家自然科学基金

0+阅读 · 2009年12月31日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

A Survey of Knowledge-Enhanced Pre-trained Language Models

Arxiv

18+阅读 · 2022年11月17日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

A Hierarchical Reasoning Graph Neural Network for The Automatic Scoring of Answer Transcriptions in Video Job Interviews

A Hierarchical Reasoning Graph Neural Network for The Automatic Scoring of Answer Transcriptions in Video Job Interviews

Arxiv

14+阅读 · 2020年12月22日

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Arxiv

15+阅读 · 2020年2月28日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Arxiv

17+阅读 · 2018年3月20日

Towards Understanding and Answering Multi-Sentence Recommendation Questions on Tourism

Arxiv

15+阅读 · 2018年1月5日

VIP会员

文章信息

相关主题

语言模型化

最新内容

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

专知会员服务

4+阅读 · 6月22日

综述 | 3D场景图：开放挑战与未来方向

综述 | 3D场景图：开放挑战与未来方向

专知会员服务

5+阅读 · 6月22日

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

专知会员服务

6+阅读 · 6月22日

21世纪的无人机战争

21世纪的无人机战争

专知会员服务

4+阅读 · 6月22日

《伊朗与以色列-美国热战及其对数字技术的影响》

《伊朗与以色列-美国热战及其对数字技术的影响》

专知会员服务

5+阅读 · 6月22日

《量子技术的军事任务技术适配与利用》

《量子技术的军事任务技术适配与利用》

专知会员服务

5+阅读 · 6月22日

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

专知会员服务

6+阅读 · 6月22日

美国从乌克兰无人机战争中学习经验

美国从乌克兰无人机战争中学习经验

专知会员服务

7+阅读 · 6月21日

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

专知会员服务

5+阅读 · 6月21日

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

专知会员服务

8+阅读 · 6月21日

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

专知会员服务

22+阅读 · 6月20日

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

专知会员服务

5+阅读 · 6月19日

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

专知会员服务

8+阅读 · 6月19日

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

7+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

9+阅读 · 6月18日

相关VIP内容

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 3D场景图：开放挑战与未来方向

21世纪的无人机战争

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

相关资讯

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新六篇知识图谱相关论文—Zero-shot识别、卷积二维知识图谱、变分知识图谱推理、张量分解、推荐

【论文推荐】最新六篇知识图谱相关论文—Zero-shot识别、卷积二维知识图谱、变分知识图谱推理、张量分解、推荐

专知

50+阅读 · 2018年4月25日

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

【论文推荐】最新七篇视觉问答（VQA）相关论文—差别注意力机制、视觉问题推理、视觉对话、数据可视化、记忆增强网络、显式推理

专知

17+阅读 · 2018年4月19日

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

【论文推荐】最新7篇视觉问答（VQA）相关论文—解释、读写记忆网络、逆视觉问答、视觉推理、可解释性、注意力机制、计数

专知

30+阅读 · 2018年3月22日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

相关论文

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

A Survey of Knowledge-Enhanced Pre-trained Language Models

Arxiv

18+阅读 · 2022年11月17日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Unifying Vision-and-Language Tasks via Text Generation

Arxiv

10+阅读 · 2021年2月4日

A Hierarchical Reasoning Graph Neural Network for The Automatic Scoring of Answer Transcriptions in Video Job Interviews

A Hierarchical Reasoning Graph Neural Network for The Automatic Scoring of Answer Transcriptions in Video Job Interviews

Arxiv

14+阅读 · 2020年12月22日

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Arxiv

15+阅读 · 2020年2月28日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Arxiv

17+阅读 · 2018年3月20日

Towards Understanding and Answering Multi-Sentence Recommendation Questions on Tourism

Arxiv

15+阅读 · 2018年1月5日

相关基金

活性氧介导的内质网应激在博莱霉素诱发肺上皮-间质转化和肺纤维化中的作用

国家自然科学基金

0+阅读 · 2016年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

金属磁性材料的拓扑结构与电磁性能

国家自然科学基金

0+阅读 · 2013年12月31日

EAST装置上离子回旋新型长天线的设计与耦合研究

国家自然科学基金

0+阅读 · 2012年12月31日

U2 snRNP在DNA损伤应激反应中的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

microRNA调节肿瘤抑制因子Caliban应答DNA损伤的机制

国家自然科学基金

1+阅读 · 2012年12月31日

共振Schottky探针研制

国家自然科学基金

0+阅读 · 2012年12月31日

癌/睾丸抗原HCA587对转录因子NF-κB的调节作用研究

国家自然科学基金

0+阅读 · 2011年12月31日

联合携PGE2基因骨髓间充质干细胞移植重塑Kuppfer细胞表型诱导大鼠肝移植术后免疫耐受

国家自然科学基金

0+阅读 · 2011年12月31日

惰性气体与O2,N2,CO,NO、CN相互作用势及碰撞激发截面的同位素效应研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员