H2OVL-Mississippi Vision Language Models Technical Report

Smaller vision-language models (VLMs) are becoming increasingly important for privacy-focused, on-device applications due to their ability to run efficiently on consumer hardware for processing enterprise commercial documents and images. These models require strong language understanding and visual capabilities to enhance human-machine interaction. To address this need, we present H2OVL-Mississippi, a pair of small VLMs trained on 37 million image-text pairs using 240 hours of compute on 8 x H100 GPUs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition, achieving state of the art performance on the Text Recognition portion of OCRBench and surpassing much larger models in this area. Additionally, we are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics across various academic benchmarks. Both models build upon our prior work with H2O-Danube language models, extending their capabilities into the visual domain. We release them under the Apache 2.0 license, making VLMs accessible to everyone, democratizing document AI and visual LLMs.

翻译：小型视觉语言模型因其能够在消费级硬件上高效运行以处理企业商业文档和图像，正日益成为注重隐私的端侧应用的关键。这些模型需要强大的语言理解和视觉能力以增强人机交互。为满足这一需求，我们提出了H2OVL-Mississippi，这是一对在3700万图文对上训练的小型视觉语言模型，使用了8张H100 GPU共计240小时的计算资源。H2OVL-Mississippi-0.8B是一个拥有8亿参数的微型模型，专精于文本识别，在OCRBench的文本识别部分取得了最先进的性能，并在该领域超越了参数量大得多的模型。此外，我们还发布了H2OVL-Mississippi-2B，这是一个拥有20亿参数的通用模型，在多个学术基准测试中展现出极具竞争力的指标。这两个模型均基于我们先前在H2O-Danube语言模型上的工作，并将其能力扩展至视觉领域。我们依据Apache 2.0许可证发布这些模型，旨在使视觉语言模型惠及所有人，推动文档AI和视觉大语言模型的民主化。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日