Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data

The impressive advances and applications of large language and joint language-and-visual understanding models has led to an increased need for methods of probing their potential reasoning capabilities. However, the difficulty of gather naturally-occurring data for complex multi-modal reasoning tasks bottlenecks the evaluation of AI methods on tasks which are not already covered by an academic dataset. In this work, we leverage recent advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks. We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task which is not well covered by existing datasets. We benchmark the performance of a state-of-the-art visual question answering (VQA) model against data generated with this method, and demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks.

翻译：大语言模型及语言-视觉联合理解模型的显著进展与应用，使得探知其潜在推理能力的方法需求日益增长。然而，针对尚未被学术数据集覆盖的复杂多模态推理任务，自然数据采集的困难制约了人工智能方法的评估。本研究利用高分辨率文本到图像生成技术的最新突破，构建了一个为多模态推理任务生成评估数据的框架。应用该框架生成上下文相关的异常数据，创建了一个现有数据集覆盖不足的挑战性任务合成数据集。我们以该数据为基准，评估了当前最先进的视觉问答（VQA）模型的表现，结果表明：虽然该任务具有可解性，但模型在上下文相关的异常检测任务中的表现显著逊色于标准VQA任务。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

纽约大学/Facebook人工智能研究院--2022春季最新课程【认知计算建模】Computational cognitive modeling（含ppt）

专知会员服务

59+阅读 · 2022年2月28日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日