In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
翻译:在本报告中,我们介绍了Vintern-1B,这是一个面向越南语任务的、可靠的10亿参数多模态大语言模型。通过将Qwen2-0.5B-Instruct语言模型与InternViT-300M-448px视觉模型相结合,Vintern-1B针对一系列应用进行了优化,包括光学字符识别、文档提取以及越南语语境下的通用问答。该模型在超过300万个图像-问题-答案对组成的大规模数据集上进行了微调,在OpenViVQA和ViTextVQA等多个越南语基准测试中均实现了鲁棒的性能和可靠的结果。Vintern-1B体积小巧,足以轻松适配各种端侧应用。此外,我们还开源了多个使用Gemini 1.5 Flash创建的、针对文本和图表的越南语视觉问答数据集。我们的模型可在以下地址获取:https://huggingface.co/5CD-AI/Vintern-1B-v2。