Reading dense text and locating objects within images are fundamental abilities for Large Vision-Language Models (LVLMs) tasked with advanced jobs. Previous LVLMs, including superior proprietary models like GPT-4o, have struggled to excel in both tasks simultaneously. Moreover, previous LVLMs with fine-grained perception cost thousands of tokens per image, making them resource-intensive. We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens. Critical improvements include: (1) Token Compression: Building on the efficient architecture of its predecessor, TextHawk2 significantly reduces the number of tokens per image by 16 times, facilitating training and deployment of the TextHawk series with minimal resources. (2) Visual Encoder Reinforcement: We enhance the visual encoder through LVLM co-training, unlocking its potential for previously unseen tasks like Chinese OCR and grounding. (3) Data Diversity: We maintain a comparable scale of 100 million samples while diversifying the sources of pre-training data. We assess TextHawk2 across multiple benchmarks, where it consistently delivers superior performance and outperforms closed-source models of similar scale, such as achieving 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% accuracy@0.5 on RefCOCOg-test.
翻译:阅读密集文本并在图像中定位对象是大型视觉语言模型(LVLM)执行高级任务所需的基本能力。先前的LVLM,包括像GPT-4o这样的优秀专有模型,都难以在这两项任务上同时表现出色。此外,先前具有细粒度感知能力的LVLM每张图像需要消耗数千个令牌,使其资源密集。我们提出了TextHawk2,这是一种具备高效细粒度感知能力的双语LVLM,在通用任务、OCR任务和定位任务上以16倍更少的图像令牌展现出尖端性能。关键改进包括:(1)令牌压缩:基于其前身的高效架构,TextHawk2将每张图像的令牌数量显著减少了16倍,从而以最少的资源促进了TextHawk系列模型的训练与部署。(2)视觉编码器增强:我们通过LVLM协同训练增强了视觉编码器,释放了其在先前未见任务(如中文OCR和定位)上的潜力。(3)数据多样性:我们在保持约1亿样本规模的同时,实现了预训练数据来源的多样化。我们在多个基准测试上评估了TextHawk2,其始终提供卓越性能,并超越了类似规模的闭源模型,例如在OCRBench上达到78.4%的准确率,在ChartQA上达到81.4%的准确率,在DocVQA上达到89.6%的ANLS分数,以及在RefCOCOg-test上达到88.1%的准确率@0.5。