LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Nimrod Shabtay,Felipe Maia Polo,Sivan Doveh,Wei Lin,M. Jehanzeb Mirza,Leshem Chosen,Mikhail Yurochkin,Yuekai Sun,Assaf Arbelle,Leonid Karlinsky,Raja Giryes

The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). Our dataset is available online on HuggingFace, and our code will be available here.

翻译：通过从网络抓取数据进行大规模训练，多模态模型在获取执行多种下游任务所需的世界知识方面展现出卓越效用。然而，从网络抓取数据的一个潜在弊端是可能牺牲用于评估这些模型能力的基准测试数据。为防止测试数据污染并真实检验这些基础模型的能力，我们提出LiveXiv：一个基于科学arXiv论文的可扩展演进式动态基准测试。LiveXiv可在任意时间戳获取特定领域的学术手稿，并基于稿件中的多模态内容（如图表、曲线和表格）自动生成视觉问答对，整个过程无需人工干预。此外，我们引入了一种高效评估方法，通过仅评估模型子集来估算所有模型在动态基准上的性能，从而显著降低总体评估成本。我们在基准测试的首个版本上对多个开源和专有大型多模态模型进行了评测，结果表明该基准具有挑战性，能有效揭示模型的真实能力并避免数据污染。最后，为保障高质量标准，我们收集并评估了经人工验证的数据子集。通过比较其整体结果与自动标注结果，我们发现性能差异极小（<2.5%）。我们的数据集已发布于HuggingFace平台，相关代码将在此处公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日