Vision Language Models (VLMs) have undergone a rapid evolution, giving rise to significant advancements in the realm of multimodal understanding tasks. However, the majority of these models are trained and evaluated on English-centric datasets, leaving a gap in the development and evaluation of VLMs for other languages, such as Japanese. This gap can be attributed to the lack of methodologies for constructing VLMs and the absence of benchmarks to accurately measure their performance. To address this issue, we introduce a novel benchmark, Japanese Heron-Bench, for evaluating Japanese capabilities of VLMs. The Japanese Heron-Bench consists of a variety of imagequestion answer pairs tailored to the Japanese context. Additionally, we present a baseline Japanese VLM that has been trained with Japanese visual instruction tuning datasets. Our Heron-Bench reveals the strengths and limitations of the proposed VLM across various ability dimensions. Furthermore, we clarify the capability gap between strong closed models like GPT-4V and the baseline model, providing valuable insights for future research in this domain. We release the benchmark dataset and training code to facilitate further developments in Japanese VLM research.
翻译:视觉语言模型(VLM)经历了快速发展,在多模态理解任务领域取得了显著进展。然而,大多数模型都是基于英语中心的数据集进行训练和评估,导致日语等其他语言的VLM开发与评估存在空白。这一空白可归因于缺乏构建VLM的方法论以及准确衡量其性能的基准。为解决此问题,我们提出了一种新型基准——Japanese Heron-Bench,用于评估VLM的日语能力。Japanese Heron-Bench包含大量针对日语语境定制的图像-问答对。此外,我们提出了一个基线日语VLM,该模型基于日语视觉指令调优数据集训练而成。我们的Heron-Bench揭示了所提出VLM在不同能力维度上的优势与局限。同时,我们阐明了GPT-4V等强封闭模型与基线模型之间的能力差距,为该领域的未来研究提供了宝贵见解。我们将发布基准数据集和训练代码,以促进日语VLM研究的进一步发展。