Tiny LVLM-eHub: Early Multimodal Experiments with Bard

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress in tackling complex multimodal tasks. Among these cutting-edge developments, Google's Bard stands out for its remarkable multimodal capabilities, promoting comprehensive comprehension and reasoning across various domains. This work presents an early and holistic evaluation of LVLMs' multimodal abilities, with a particular focus on Bard, by proposing a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the vanilla version, Tiny LVLM-eHub possesses several appealing properties. Firstly, it provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence, through quantitative evaluation of $42$ standard text-related visual benchmarks. Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach. Thirdly, it comprises a mere $2.1$K image-text pairs, facilitating ease of use for practitioners to evaluate their own offline LVLMs. Through extensive experimental analysis, this study demonstrates that Bard outperforms previous LVLMs in most multimodal capabilities except object hallucination, to which Bard is still susceptible. Tiny LVLM-eHub serves as a baseline evaluation for various LVLMs and encourages innovative strategies aimed at advancing multimodal techniques. Our project is publicly available at \url{https://github.com/OpenGVLab/Multi-Modality-Arena}.

翻译：近期大型视觉语言模型（LVLMs）在多模态复杂任务中取得了显著进展。在这些前沿发展中，谷歌的Bard以其卓越的多模态能力脱颖而出，促进了跨领域的全面理解与推理。本文通过提出LVLM-eHub的轻量级变体——名为Tiny LVLM-eHub，对LVLMs的多模态能力进行了早期且全面的评估，尤其聚焦于Bard模型。与原始版本相比，Tiny LVLM-eHub具备多个引人注目的特性。首先，它通过对42个标准文本相关视觉基准的定量评估，系统性地衡量了六类多模态能力，包括视觉感知、视觉知识获取、视觉推理、视觉常识、物体幻觉以及具身智能。其次，它利用ChatGPT集成评估（CEE）对LVLMs的预测进行深度分析，从而得出稳健且准确的评估结果，相较于词匹配方法，展现出与人类评价更佳的一致性。第三，它仅包含2.1K个图像-文本对，便于实践者轻松评估其离线的LVLMs。通过广泛的实验分析，本研究表明，除物体幻觉（Bard在此方面仍易受影响）外，Bard在大多数多模态能力上超越了以往的LVLMs。Tiny LVLM-eHub可作为多种LVLMs的基准评估工具，并鼓励旨在推进多模态技术的创新策略。我们的项目已在\url{https://github.com/OpenGVLab/Multi-Modality-Arena}公开。