Smaller vision-language models (VLMs) are becoming increasingly important for privacy-focused, on-device applications due to their ability to run efficiently on consumer hardware for processing enterprise commercial documents and images. These models require strong language understanding and visual capabilities to enhance human-machine interaction. To address this need, we present H2OVL-Mississippi, a pair of small VLMs trained on 37 million image-text pairs using 240 hours of compute on 8 x H100 GPUs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition, achieving state of the art performance on the Text Recognition portion of OCRBench and surpassing much larger models in this area. Additionally, we are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics across various academic benchmarks. Both models build upon our prior work with H2O-Danube language models, extending their capabilities into the visual domain. We release them under the Apache 2.0 license, making VLMs accessible to everyone, democratizing document AI and visual LLMs.
翻译:小型视觉语言模型因其能够在消费级硬件上高效运行以处理企业商业文档和图像,正日益成为注重隐私的端侧应用的关键。这些模型需要强大的语言理解和视觉能力以增强人机交互。为满足这一需求,我们提出了H2OVL-Mississippi,这是一对在3700万图文对上训练的小型视觉语言模型,使用了8张H100 GPU共计240小时的计算资源。H2OVL-Mississippi-0.8B是一个拥有8亿参数的微型模型,专精于文本识别,在OCRBench的文本识别部分取得了最先进的性能,并在该领域超越了参数量大得多的模型。此外,我们还发布了H2OVL-Mississippi-2B,这是一个拥有20亿参数的通用模型,在多个学术基准测试中展现出极具竞争力的指标。这两个模型均基于我们先前在H2O-Danube语言模型上的工作,并将其能力扩展至视觉领域。我们依据Apache 2.0许可证发布这些模型,旨在使视觉语言模型惠及所有人,推动文档AI和视觉大语言模型的民主化。