To develop high-performing Visual Language Models (VLMs), it is essential to prepare multimodal resources, such as image-text pairs, interleaved data, and instruction data. While multimodal resources for English are abundant, there is a significant lack of corresponding resources for non-English languages, such as Japanese. To address this problem, we take Japanese as a non-English language and propose a method for rapidly creating Japanese multimodal datasets from scratch. We collect Japanese image-text pairs and interleaved data from web archives and generate Japanese instruction data directly from images using an existing VLM. Our experimental results show that a VLM trained on these native datasets outperforms those relying on machine-translated content.
翻译:为开发高性能的视觉语言模型,准备多模态资源(如图文对、交错数据及指令数据)至关重要。尽管英语的多模态资源十分丰富,但日语等非英语语言却严重缺乏相应资源。为解决此问题,我们以日语作为非英语语言代表,提出一种从零开始快速构建日语多模态数据集的方法。我们从网络存档中收集日语图文对与交错数据,并利用现有视觉语言模型直接从图像生成日语指令数据。实验结果表明,基于这些原生数据集训练的视觉语言模型,其性能优于依赖机器翻译内容的模型。