In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
翻译:本报告介绍了Vintern-1B,这是一个可靠的、拥有10亿参数的多模态大语言模型(MLLM),专为越南语任务设计。通过将Qwen2-0.5B-Instruct语言模型与InternViT-300M-448px视觉模型相结合,Vintern-1B针对一系列应用进行了优化,包括光学字符识别(OCR)、文档提取以及越南语语境下的通用问答。该模型在超过300万个图像-问题-答案对组成的大规模数据集上进行了微调,在OpenViVQA和ViTextVQA等多个越南语基准测试中均实现了鲁棒的性能和可靠的结果。Vintern-1B模型规模足够小,可以轻松适配各种端侧应用。此外,我们还开源了多个使用Gemini 1.5 Flash创建的、针对文本和图表的越南语视觉问答(VQA)数据集。我们的模型发布于:https://huggingface.co/5CD-AI/Vintern-1B-v2。