The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). Our dataset is available online on HuggingFace, and our code will be available here.
翻译:通过从网络抓取数据进行大规模训练,多模态模型在获取执行多种下游任务所需的世界知识方面展现出卓越效用。然而,从网络抓取数据的一个潜在弊端是可能牺牲用于评估这些模型能力的基准测试数据。为防止测试数据污染并真实检验这些基础模型的能力,我们提出LiveXiv:一个基于科学arXiv论文的可扩展演进式动态基准测试。LiveXiv可在任意时间戳获取特定领域的学术手稿,并基于稿件中的多模态内容(如图表、曲线和表格)自动生成视觉问答对,整个过程无需人工干预。此外,我们引入了一种高效评估方法,通过仅评估模型子集来估算所有模型在动态基准上的性能,从而显著降低总体评估成本。我们在基准测试的首个版本上对多个开源和专有大型多模态模型进行了评测,结果表明该基准具有挑战性,能有效揭示模型的真实能力并避免数据污染。最后,为保障高质量标准,我们收集并评估了经人工验证的数据子集。通过比较其整体结果与自动标注结果,我们发现性能差异极小(<2.5%)。我们的数据集已发布于HuggingFace平台,相关代码将在此处公开。