The impressive advances and applications of large language and joint language-and-visual understanding models has led to an increased need for methods of probing their potential reasoning capabilities. However, the difficulty of gather naturally-occurring data for complex multi-modal reasoning tasks bottlenecks the evaluation of AI methods on tasks which are not already covered by an academic dataset. In this work, we leverage recent advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks. We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task which is not well covered by existing datasets. We benchmark the performance of a state-of-the-art visual question answering (VQA) model against data generated with this method, and demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks.
翻译:大语言模型及语言-视觉联合理解模型的显著进展与应用,使得探知其潜在推理能力的方法需求日益增长。然而,针对尚未被学术数据集覆盖的复杂多模态推理任务,自然数据采集的困难制约了人工智能方法的评估。本研究利用高分辨率文本到图像生成技术的最新突破,构建了一个为多模态推理任务生成评估数据的框架。应用该框架生成上下文相关的异常数据,创建了一个现有数据集覆盖不足的挑战性任务合成数据集。我们以该数据为基准,评估了当前最先进的视觉问答(VQA)模型的表现,结果表明:虽然该任务具有可解性,但模型在上下文相关的异常检测任务中的表现显著逊色于标准VQA任务。