Recent years have witnessed remarkable progress in the development of large vision-language models (LVLMs). Benefiting from the strong language backbones and efficient cross-modal alignment strategies, LVLMs exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning. However, the capabilities of LVLMs have not been comprehensively and quantitatively evaluate. Most existing multi-modal benchmarks require task-oriented input-output formats, posing great challenges to automatically assess the free-form text output of LVLMs. To effectively leverage the annotations available in existing benchmarks and reduce the manual effort required for constructing new benchmarks, we propose to re-formulate existing benchmarks into unified LVLM-compatible formats. Through systematic data collection and reformulation, we present the ReForm-Eval benchmark, offering substantial data for evaluating various capabilities of LVLMs. Based on ReForm-Eval, we conduct extensive experiments, thoroughly analyze the strengths and weaknesses of existing LVLMs, and identify the underlying factors. Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.
翻译:近年来,大型视觉语言模型(LVLMs)取得了显著进展。得益于强大的语言骨干网络和高效的跨模态对齐策略,LVLMs展现出惊人的视觉信号感知能力与基于视觉的推理能力。然而,当前对LVLMs能力的全面量化评估仍十分有限。现有大多数多模态基准要求任务导向的输入输出格式,这为自动评估LVLMs的自由文本输出带来了巨大挑战。为了有效利用现有基准中的标注数据并减少构建新基准所需的人工投入,本文提出将现有基准重构为统一的LVLM兼容格式。通过系统性的数据采集与重构,我们构建了ReForm-Eval基准,为评估LVLMs的多维能力提供了丰富数据。基于ReForm-Eval,我们开展了大量实验,深入分析了现有LVLMs的优势与不足,并识别出其潜在影响因素。我们的基准与评估框架将开源,为推进LVLMs的发展奠定基础。